
Custom normal texture encoding function creation How to.
When creating a mobile game, you encounter a number of problems.
In addition to Draw-Call, which is one of the most common CPUs, texture fetching occurs frequently.
When trying to optimize, even experienced artists have heard this part to some extent, so they try not to make as many samplers as possible.
However, the root cause is not well understood.
Anyway, I am not trying to talk about hardware mechanisms.
Understanding these stories is because I’m probably an expert who has already reached a level where I don’t have to look at my writing when I think.
Anyway, let’s just know that it is good to declare Sampler in the smallest amount possible.
Normally you would have seen Sampler2D or tex2D frequently, which requires Texture Fetching from the CPU.
In the end, the frequency of texture fetching that is serialized like a draw-call is not so good news.
When developing a skin shader or other more complex shader, it is a must to use more than one normal map.
Eventually, you will have questions about how you can group two normal maps into one.
Mathematical processing is unlikely to be of interest, so let’s go straight to the main topic and see how it can be implemented.
I also used the Amplify shader Editor to unravel this more easily.
URP will cover the same thing, but for now, let’s look at it faithfully.
See Unity3D built-in Shader code.
I refer to ShaderLinrary / Packing.hlsl of URP Core.
// Unpack from normal map
real3 UnpackNormalRGB(real4 packedNormal, real scale = 1.0)
{
real3 normal;
normal.xyz = packedNormal.rgb * 2.0 - 1.0;
normal.xy *= scale;
return normalize(normal);
}
Let’s modify the code to fit the purpose and add a new function by referring to the two functions above. The new function added will be tested using the Custom Expression in the Amplify Shader Editor.
What is clear is the purpose of this new function.
1. Combine two normal maps into one normal map so that only one set texture occurs. 2. Consider the basic method of optimization for mobile games. As an example, let us accept generously for the variance results of Approximation to minimize ALU.
It was reconstructed into two function forms.
JP_UnpackNormalRG_SafeNormal
inline float3 JP_UnpackNormalRG_SafeNormal( half2 normalXY )
{
half3 normal;
normal.xy = normalXY.xy * 2 - 1;
normal.z = sqrt(1 - saturate(dot(normal.xy, normal.xy)));
return normalize(normal);
}
JP_UnpackNormalRG_SafeNormal_Optimal
inline float3 JP_UnpackNormalRG_SafeNormal_Optimal( half2 normalXY )
{
return normalize(half3(normalXY.xy * 2 - 1 , 1));
}


I used TransformDirection to visually debug the two functions and compared them.
Let’s compile the above two expressions and compare the ALU quantity.
There are three differences in instruction quantity on the Disassemble code.
JP_UnpackNormalRG_SafeNormal 의 Disassemble code block.
// SV_Target 0 xyzw 0 TARGET float xyzw
ps_4_0
dcl_constantbuffer CB0[5], immediateIndexed
dcl_sampler s0, mode_default
dcl_resource_texture2d (float,float,float,float) t0
dcl_input_ps linear v1.xy
dcl_output o0.xyzw
dcl_temps 1
0: mad r0.xy, v1.xyxx, cb0[4].xyxx, cb0[4].zwzz
1: sample r0.xyzw, r0.xyxx, t0.xyzw, s0
2: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), l(-1.000000, -1.000000, 0.000000, 0.000000)
3: dp2 r0.w, r0.xyxx, r0.xyxx
4: min r0.w, r0.w, l(1.000000)
5: add r0.w, -r0.w, l(1.000000)
6: sqrt r0.z, r0.w
7: dp3 r0.w, r0.xyzx, r0.xyzx
8: rsq r0.w, r0.w
9: mad o0.xyz, r0.xyzx, r0.wwww, l(0.000010, 0.000010, 0.000010, 0.000000)
10: mov o0.w, l(1.000000)
11: ret
// Approximately 0 instruction slots used
JP_UnpackNormalRG_SafeNormal_Optimal 의 Disassemble code block.
// SV_Target 0 xyzw 0 TARGET float xyzw
ps_4_0
dcl_constantbuffer CB0[5], immediateIndexed
dcl_sampler s0, mode_default
dcl_resource_texture2d (float,float,float,float) t0
dcl_input_ps linear v1.xy
dcl_output o0.xyzw
dcl_temps 1
0: mad r0.xy, v1.xyxx, cb0[4].xyxx, cb0[4].zwzz
1: sample r0.xyzw, r0.xyxx, t0.xyzw, s0
2: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), l(-1.000000, -1.000000, 0.000000, 0.000000)
3: mov r0.z, l(1.000000)
4: dp3 r0.w, r0.xyzx, r0.xyzx
5: rsq r0.w, r0.w
6: mad o0.xyz, r0.xyzx, r0.wwww, l(0.000010, 0.000010, 0.000010, 0.000000)
7: mov o0.w, l(1.000000)
8: ret
// Approximately 0 instruction slots used
In addition, as a result, since two normal maps were collected and processed at once, more commands were also saved. Or, when different texture sets are required, you can record different texture information in the NormalMap’s B and A channels.
In the case of the optimization function, since the part for z is simply defined as a constant, the calculation for each pixel of the pixel normal is also minimized.
Also, since two set textures were processed with one set texture, the optimization for the CPU bottleneck was also performed.
