通过新的任天堂Switch模式了解CPU、GPU和温度状态

通过新的任天堂Switch模式确定CPUGPU和温度状态

Writing by JP.Lee

Nintendo Switch是任天堂开发的小型游戏机。

我认为,作为纯游戏用途的Nintendo Switch卖得好的原因是比智能手机小,易于操作,并且还支持Wi-Fi,因此可以多重博弈。

任天堂 Switch的配置和规格如下。

可以说,现有的iphone11顶级版本或者iPad Pro第三代的性能与任天堂Switch相似。

但是我个人认为,任天堂Switch具有专注于游戏性的内部设计结构,所以更优秀一点。

此外,安装在扩展坞的时候和手持式状态的优化系统也非常优秀。

* CPU : NVIDIA TEGRA X1+ T210B01

* GPU : NVIDIA GM20B

* RAM : LPDDR4X SDRAM 4GB

* 网络 : 2.4GHz/5GHz支持至802.11ac。 蓝牙 4.0

Tegra X1 & X1+ (T210 / T210B01)

Part NumberT210 (X1)T210B01 (X1+)
CPUARM Cortex-A57 MP4 1.9 GHz + ARM Cortex-A53 MP4 1.3 GHzARM Cortex-A57 MP4 1.9 GHz + ARM Cortex-A53 MP4 1.3 GHz
GPUNVIDIA Maxwell GM20B MP256 1 GHzNVIDIA Maxwell GM20B MP256 1 GHz
Memory64-bit Dual Channel LPDDR3/LPDDR4 3200 MHzLPDDR4/LPDDR4X
生产过程TSMC 20nm SoC16nm
内部调制解调器没有没有
主要使用设备Nintendo Switch, SHIELD Android TV, Pixel C, Jetson TX1 Jetson NanoNintendo Switch(新流程模型), Nintendo Switch Lite, SHIELD Android TV (2019)

GPU从基于Kepler体系结构的早期版本升级到第二代Maxwell体系结构。

支持OpenGL ES 3.1 API和OpenGL 4.4 API并且据称其性能处于Maxwell体系结构的笔记本电脑用GPU-GeForce 830M和840M, 940M的中间级别。

下面就来了解一下,显示任天堂Switch的CPU、GPU及温度的覆盖程序(overlay program)。​

在最简单的级别上,Nintendo Switch的一个CPU内核保留给OS使用, Nintendo Switch探索shell的时候从始至终内核0至2处于休眠状态,使用菜单时只有内核3处于活动状态。

从屏幕上的情况来看,对接在Switch的clock在游戏过程中处于完全被固定的状态。CPU锁定在1020MHz,GPU锁定在768MHz,内置内存控制器锁定在1600MHz。

但是有一个反转,就是Nintendo Switch的Boost Mode。

这是通过特定游戏的优化,选择性地将CPU进行OverClock,提升加载速度(Loading Speed)。

例如,当你死在《超级马里奥:奥德赛》,屏幕将会模糊成黑色,并且游戏将引导你到达最后的检查点。在《超级马里奥:奥德赛》中,由于Boost Mode加载速度变得更快。在加载期间,CPU会暂时提升75%到达1785MHz。而另一方面,GPU的运行频率为76.8 MHz,是正常水平的十分之一。

任天堂通过自动设置CPU和GPU两个方向的clock来调整热量平衡。

许多最新游戏都使用了此技术。

《德军总部:新血脉》和《古惑狼赛车(CTR: Crash Team Racing)》正在使用该技术,《塞尔达传说:荒野之息》和《超级马里奥:奥德赛》通过补丁应用此项技术。加载时间受CPU释放速度的影响,而不是NAND FlashSD memory card的速度。

当没有画面或画面保持时,无需将显卡使用到最大化。但是,当游戏开始时,Nintendo Switch会被更改为Default clock。启用Boost mode时,从主页进入到“The Legend of Zelda: Breath of the Wild”分别需要消耗23秒和30秒的时间,快了将近30%。

System monitoring overlay还揭示了游戏在 Nintendo Switch操作系统级别上,为了提供Over-clock的性能,将Switch的硬件推到了哪种地步。

这与Boost Mode无关。当第一次找到Switch的clock时,CPU固定为1020MHz,,GPU固定为307.2MHz。但是在发布之前,GPU高达384MHz。如今,Nintendo Switch面临的最大挑战是将GPU驱动到460MHz。但这只不过是故事的一部分。

Mortal Combat 11是一个普通的标本。竞技场加载之后,从开场画面到开始游戏为止CPU运行到了460MHz。虽然这是例外的高-clock,但仅适用于游戏操作。回到菜单时,会降至384MHz。尽管《超级马里奥:奥德赛》使用了改进的CPU clock,但有些惊人的已上市作品并没有使用。《地狱之刃:塞娜的献祭》具有较高的动态分辨率,因此帧率较慢,所以被认为速度会更高,但GPU的运行频率为标准的384MHz。

System monitoring overlay还提供了有关 Nintendo Switch温度的详细报告。当接入时,使风扇旋转速度最快的游戏是《毁灭战士(DOOM)》和《德军总部(Wolfenstein)》,而且从温度来看也是如此的。

在22度时,这两个游戏瞬间将PCB基板提升至60度,Tegra GPU提升至55度。风扇在最高速度时转到了47%。

更高的风扇速度(Fan Speed)是有可能的,但是在持续的测试中,两者的结果显示热度与《路易鬼屋3(Luigi’s Mansion 3)》一起并排最高。当这三款游戏被发现是吃电的河马时, 将CPU内核在最高温度下使用接近90%的情况就显得很自然。同样,当Switch避开水的沸点100度,而选择撤离至60度的安全温度时,有充分处理over-clock的可能性。over-clock遇到的最大的问题是音质。当CPU和GPU旋转速度如此之快时,风扇的噪音会让人很困扰。

但是,将clock提升至某种程度是任天堂的计划。

那就肯定有最佳的over-clock方法,事实上就是将Tegra X1 CPU从标准clock提升了约20%的开发者模式。我们的测试使用了家里的over-clock工具也不会让系统clock轻易流失电池的电量,且在许多游戏中减轻了性能问题。

系统覆盖显示,《任天堂明星大乱斗特别版(Super Smash Bros. Ultimate)》、《毁灭战士4(DOOM)》、《德军总部(Wolfenstein)》以及《路易鬼屋3(Luigi’s Mansion 3)》使用了90%以上的CPU,而闲置资源有助于提高性能。例如,在《德军总部:新血液(Wolfenstein: Youngblood)》上进行的一项快速测试表明,从第一阶段开始就改善了整体的柔和度。

Nintendo Switch证明了正努力调整性能状态,正如动态的CPU速度和减少加载时间的boost模式,以及作为移动时的数据460MHz所显示的那样。随之而来的还有更多的东西作为支撑,这在CPU的领域显得更加充满希望。

除了沿着Silicon device读取使用意图之外,对于在MODE社区的帮助下控制台如何对超频系统组件或游戏进行时的细微调整进行双重作用,以及任天堂是否会持续开发性能等问题都可以达到更深入全面的了解。

系统观察Overlay大体上反映出了Switch是多么多才多艺,通过达到硬件在风扇速度、GPU负载率以及性能之间的细致良好的平衡,可以进化到什么程度。 这是对现一代中控台的工作方式最容易理解的视角,也让人好奇任天堂下一步会如何发展。

Create and apply virtual lighting vectors.

Create and apply virtual lighting vectors.

JP.Lee
Summary.

When working on demand related to mobile game rendering, it is often impossible to perform complex lighting processing due to hardware performance.However, artists’ demand continues to increase.In particular, there are times when it is necessary to correct information that has been subjected to static lighting (BakedLighting).I’m going to create a simple example and explain how to additionally create indirect lighting from the scene’s lighting and how to use it.

Content.

1. Simple C # script writing ability is required.
2. Basic shader code writing skills are required. However, for ease of explanation, we will use the Amplify Shader Editor.

Only the direction, brightness, and color of the dominant lighting Directional Light is needed.
Just import the direction vector of the engine’s Directional light.The direction vector is Transform.forward.
There is no need to calculate the direction vector separately or obtain the direction vector separately from the shader.
Let’s think about what we can use as Mask in our rendering system.
Statically produced black and white textures may be one of the most commonly used masks.However, it is not enough to displace the object or to deepen the shading and lighting effects of the geometry.
I personally use NdotL frequently as a mask.
In particular, it is often used for processing effects of dynamic objects.Expressions like mad (NdotL, A, B) can control NdotL more flexibly than HalfLambert.Especially when I use it as a mask, I mainly use it.
Alternatively, we can use the ShadowMask that we use as a Mask, and Ambient Attenuation can be used as a Mask if needed.
Let’s look at an example to get an idea.
The lighting component of the scene.
Directional Light ( Dir vector , Color , Intensity , Attenuations… so on )
|_ _ Sub Light (Dir vector , Intensity , Color is Option )

I’m going to make something like this.

The sub light was added directly as a child node of the Directional Light. If you add a component directly to the Directional Light, you have to make the code more complicated.
The artist will be able to easily adjust the direction of the sub light.

The completed script (Component) has Color and Intensity exposed.However, this example does not use this value directly.
First, let’s look at the simple implementation code below.

using UnityEngine;

#if UNITY_EDITOR
    using UnityEditor;
#endif

using System;
using System.Collections;
using System.Collections.Generic;

[ExecuteInEditMode]
[DisallowMultipleComponent]
public class XRPVirtualDirectLight : MonoBehaviour
{
    Light lightGameOject;
    
    public Color color_ambient = Color.white;
    public float intensity = 0.5f;
    public float gizmoSize = 0.25f;
    Vector3 LightingForwardVector;


#if UNITY_EDITOR
    void OnValidate()
    {
        
    }
    
    void Start()
    {
        lightGameOject = gameObject.GetComponentInParent(typeof(Light))as Light;
    }

    void Update()
    {
        if (EditorApplication.isPlayingOrWillChangePlaymode) return;
        LightingForwardVector = this.transform.forward;
        Shader.SetGlobalVector("_LightingForwardVector", -LightingForwardVector);

    }

    void OnDrawGizmos()
    {
        DrawGizmosAll();
    }

    void DrawGizmosAll()
    {
        
        Gizmos.color = Color.red;
        var endPoint = transform.position + transform.forward * 2;
        Gizmos.DrawLine(transform.position, endPoint);
        Gizmos.DrawWireSphere(transform.position, gizmoSize);
        Gizmos.DrawLine(endPoint, endPoint + (transform.position + transform.right - endPoint).normalized * 0.5f);
        Gizmos.DrawLine(endPoint, endPoint + (transform.position - transform.right - endPoint).normalized * 0.5f);
        Gizmos.DrawLine(endPoint, endPoint + (transform.position + transform.up - endPoint).normalized * 0.5f);
        Gizmos.DrawLine(endPoint, endPoint + (transform.position - transform.up - endPoint).normalized * 0.5f);

        Gizmos.color = Color.yellow;
        var endPointDirLight = lightGameOject.transform.position + lightGameOject.transform.forward * 3;
        Gizmos.DrawLine(endPointDirLight, endPointDirLight + (lightGameOject.transform.position + lightGameOject.transform.right - endPointDirLight).normalized * 0.5f);
        Gizmos.DrawLine(endPointDirLight, endPointDirLight + (lightGameOject.transform.position - lightGameOject.transform.right - endPointDirLight).normalized * 0.5f);
        Gizmos.DrawLine(endPointDirLight, endPointDirLight + (lightGameOject.transform.position + lightGameOject.transform.up - endPointDirLight).normalized * 0.5f);
        Gizmos.DrawLine(endPointDirLight, endPointDirLight + (lightGameOject.transform.position - lightGameOject.transform.up - endPointDirLight).normalized * 0.5f);
        Gizmos.DrawLine(lightGameOject.transform.position, endPointDirLight);

    }

#endif
}
void Start()
    {
        lightGameOject = gameObject.GetComponentInParent(typeof(Light))as Light;
    }

This part may or may not be necessary.I called the Llight component of the sub light’s parent node because it may be necessary to use the value of the Directional light. I called it because of the screen Gizmo UI in this example.Directional light was personally uncomfortable when working with shaders because you have to make a choice to know where the direction is going.Perhaps artists think similarly.

The yellow direction of the screen indicates the direction of the Directional Light, and the red indicates the direction of the Sub light.It is always displayed on the screen.
Sub-light direction vector applied to the test shader.

void Update()
    {
        if (EditorApplication.isPlayingOrWillChangePlaymode) return;
        LightingForwardVector = this.transform.forward;
        Shader.SetGlobalVector("_LightingForwardVector", -LightingForwardVector);
    }
Shader.SetGlobalVector("_LightingForwardVector", -LightingForwardVector);

If you look at this part, it was negate with -LightingForwardVector.This is because I think the white part is the part to which the additional effect will be applied when Lerp inside the shader.

I added it to the shader code to make this part work.

NdotL is wrapped with the mad function.

float MadNdotL13( float madNdotL , float inputA , float inputB )
{
   madNdotL = mad(madNdotL,inputA , inputB);
   return madNdotL;
}

If lerp (A, B, Mask), white will return B.
Let’s simply mix the two colors.

The shader for simple testing was created with the Amplify Shader Editor.

Debug view for simple understanding.

It can be applied when the artist’s demand is to be handled independently of the direction of the direction light sometimes.

For example, it can be used to modify the lighting result of a certain angular range of a specific area, it can be used as the value of Blending Weight of ShadowMask, or it can be used as a Bent Direction to modulate Fresnel.

It’s definitely not normal, but when you’re developing a mobile game, there are times when you have to think about it.

Disable to bUseUnityBuild even building to UE4 source-code.

<?xml version="1.0" encoding="utf-8" ?>
<Configuration xmlns="https://www.unrealengine.com/BuildConfiguration">
	<BuildConfiguration>
		<bUseUnityBuild>false</bUseUnityBuild>
		<bUsePCHFiles>false</bUsePCHFiles>
	</BuildConfiguration>
</Configuration>

BuildConfiguration Download

bUseUnityBuild
Whether to unify C++ code into larger files for faster compilation.

bForceUnityBuild
Whether to force C++ source files to be combined into larger files for faster compilation.

bUseAdaptiveUnityBuild
Use a heuristic to determine which files are currently being iterated on and exclude them from unity blobs, result in faster incremental compile times. The current implementation uses the read-only flag to distinguish the working set, assuming that files will be made writable by the source control system if they are being modified. This is true for Perforce, but not for Git.

bAdaptiveUnityDisablesOptimizations
Disable optimization for files that are in the adaptive non-unity working set.

bAdaptiveUnityDisablesPCH
Disables force-included PCHs for files that are in the adaptive non-unity working set.

bAdaptiveUnityDisablesProjectPCHForProjectPrivate
Backing storage for bAdaptiveUnityDisablesProjectPCH.

bAdaptiveUnityCreatesDedicatedPCH
Creates a dedicated PCH for each source file in the working set, allowing faster iteration on cpp-only changes.

bAdaptiveUnityEnablesEditAndContinue
Creates a dedicated PCH for each source file in the working set, allowing faster iteration on cpp-only changes.

MinGameModuleSourceFilesForUnityBuild
The number of source files in a game module before unity build will be activated for that module. This allows small game modules to have faster iterative compile times for single files, at the expense of slower full rebuild times. This setting can be overridden by the bFasterWithoutUnity option in a module’s Build.cs file.

Referer.
https://docs.unrealengine.com/en-US/Programming/BuildTools/UnrealBuildTool/BuildConfiguration/index.html

BENT NORMAL?

유니티 HDRP 에서 제공하고 있는 BentNormal 을 이용한 Ambient Occlusion 처리와 Reflection Occlusion 부분을 Custom URP 로 이식 해 보는 과정이다.

프로젝트에 따라 HDRP 의 특정 함수부는 활용이 가능하다고 생각한다.

예로 들어서 HDRP 에서만 작동 되는 Spherical Reflection Probe 같은 것들이다.
모바일 게임개발 프로젝트라고 해도 특정 조건과 상황에 따라 HDRP 의 기능을 이식해서 활용 할 수 있는 하드웨어 환경이 되었기 때문이데 RT 에 의존 하는 것이 아니라면 명령어 수량은 프로젝트 제작팀의 개발명세서를 따르면 되기 때문이다.

Bent Normal 란?

Reflection Occlusion 이란?

위 두 가지 요소에 대한 기본적인 이해를 먼저 선행 학습하자.

먼저 Bent Normal…

언리얼 엔진 개발문서에서 정의 하고 있는 내용을 살펴 볼 필요가 있다.(하지만 나는 유니티를 사용한다…)

https://docs.unrealengine.com/ko/Engine/Rendering/LightingAndShadows/BentNormalMaps/index.html

No BentNormal
BentNormal

언리얼 엔진 문서를 인용 하면 아래 두 가지의 큰 장점이 있습니다.

벤트 노멀의 장점

벤트 노멀을 사용했을 때 얻을 수 있는 장점 몇 가지는 아래와 같습니다.

  • 벤트 노멀을 사용했을 때 기대할 수 있는 가장 큰 효과는 라이트 빌드 이후 발생할 수 있는 빛샘 현상 감소입니다.
  • 벤트 노멀은 앰비언트 오클루전(AO)과 함께 사용하여 디퓨즈 간접광을 개선시킬 수 있습니다. 그 원리는 간접광에 노멀 대신 벤트 노멀을 사용하여 디퓨즈 간접광이 글로벌 일루미네이션(GI)과 비슷해 보이도록 만드는 것입니다.

언리얼 엔진의 모바일 플레폼 환경에서 이것이 자동적으로 적용 되는지는 아직 확인 하지 않았지만 어찌 됬든 우리는 HDRP 의 그것을 안드로이드 또는 IOS 플레폼에 주로 사용 되는 URP 에 이식 할 예정이다.

참고자료 또한 살펴 보자.
유니티 엔진에서 참고 하고 있는 자료이다.

Practical Real-Time Strategies for Accurate Indirect Occlusion (Jorge Jiménez) / SIGGRAPH 2016 COURSE: PHYSICALLY BASED RENDERING.

C:\Users\%username%\AppData\Local\Unity\cache\packages\packages.unity.com\com.unity.render-pipelines.core@7.3.1

위 Core 패키지 버전을 사용했다.

Custom normal texture encoding function creation How to.

Custom normal texture encoding function creation How to.

When creating a mobile game, you encounter a number of problems.

In addition to Draw-Call, which is one of the most common CPUs, texture fetching occurs frequently.

When trying to optimize, even experienced artists have heard this part to some extent, so they try not to make as many samplers as possible.

However, the root cause is not well understood.

Anyway, I am not trying to talk about hardware mechanisms.

Understanding these stories is because I’m probably an expert who has already reached a level where I don’t have to look at my writing when I think.

Anyway, let’s just know that it is good to declare Sampler in the smallest amount possible.

Normally you would have seen Sampler2D or tex2D frequently, which requires Texture Fetching from the CPU.

In the end, the frequency of texture fetching that is serialized like a draw-call is not so good news.

When developing a skin shader or other more complex shader, it is a must to use more than one normal map.

Eventually, you will have questions about how you can group two normal maps into one.

Mathematical processing is unlikely to be of interest, so let’s go straight to the main topic and see how it can be implemented.

I also used the Amplify shader Editor to unravel this more easily.

URP will cover the same thing, but for now, let’s look at it faithfully.

See Unity3D built-in Shader code.

I refer to ShaderLinrary / Packing.hlsl of URP Core.

// Unpack from normal map
real3 UnpackNormalRGB(real4 packedNormal, real scale = 1.0)
{
    real3 normal;
    normal.xyz = packedNormal.rgb * 2.0 - 1.0;
    normal.xy *= scale;
    return normalize(normal);
}

Let’s modify the code to fit the purpose and add a new function by referring to the two functions above. The new function added will be tested using the Custom Expression in the Amplify Shader Editor.

What is clear is the purpose of this new function.
1. Combine two normal maps into one normal map so that only one set texture occurs. 2. Consider the basic method of optimization for mobile games. As an example, let us accept generously for the variance results of Approximation to minimize ALU.

It was reconstructed into two function forms.

JP_UnpackNormalRG_SafeNormal

inline float3 JP_UnpackNormalRG_SafeNormal( half2 normalXY )
	{
		half3 normal;
		normal.xy = normalXY.xy * 2 - 1;
		normal.z = sqrt(1 - saturate(dot(normal.xy, normal.xy)));
		return normalize(normal);
	}

JP_UnpackNormalRG_SafeNormal_Optimal

inline float3 JP_UnpackNormalRG_SafeNormal_Optimal( half2 normalXY )
	{
	     return normalize(half3(normalXY.xy * 2 - 1 , 1));
	}

P_UnpackNormalRG_SafeNormal
JP_UnpackNormalRG_SafeNormal_Optimal

I used TransformDirection to visually debug the two functions and compared them.


Let’s compile the above two expressions and compare the ALU quantity.

There are three differences in instruction quantity on the Disassemble code.

JP_UnpackNormalRG_SafeNormal 의 Disassemble code block.

// SV_Target                0   xyzw        0   TARGET   float   xyzw
      ps_4_0
      dcl_constantbuffer CB0[5], immediateIndexed
      dcl_sampler s0, mode_default
      dcl_resource_texture2d (float,float,float,float) t0
      dcl_input_ps linear v1.xy
      dcl_output o0.xyzw
      dcl_temps 1
   0: mad r0.xy, v1.xyxx, cb0[4].xyxx, cb0[4].zwzz
   1: sample r0.xyzw, r0.xyxx, t0.xyzw, s0
   2: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), l(-1.000000, -1.000000, 0.000000, 0.000000)
   3: dp2 r0.w, r0.xyxx, r0.xyxx
   4: min r0.w, r0.w, l(1.000000)
   5: add r0.w, -r0.w, l(1.000000)
   6: sqrt r0.z, r0.w
   7: dp3 r0.w, r0.xyzx, r0.xyzx
   8: rsq r0.w, r0.w
   9: mad o0.xyz, r0.xyzx, r0.wwww, l(0.000010, 0.000010, 0.000010, 0.000000)
  10: mov o0.w, l(1.000000)
  11: ret 
// Approximately 0 instruction slots used

JP_UnpackNormalRG_SafeNormal_Optimal 의 Disassemble code block.

// SV_Target                0   xyzw        0   TARGET   float   xyzw
      ps_4_0
      dcl_constantbuffer CB0[5], immediateIndexed
      dcl_sampler s0, mode_default
      dcl_resource_texture2d (float,float,float,float) t0
      dcl_input_ps linear v1.xy
      dcl_output o0.xyzw
      dcl_temps 1
   0: mad r0.xy, v1.xyxx, cb0[4].xyxx, cb0[4].zwzz
   1: sample r0.xyzw, r0.xyxx, t0.xyzw, s0
   2: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), l(-1.000000, -1.000000, 0.000000, 0.000000)
   3: mov r0.z, l(1.000000)
   4: dp3 r0.w, r0.xyzx, r0.xyzx
   5: rsq r0.w, r0.w
   6: mad o0.xyz, r0.xyzx, r0.wwww, l(0.000010, 0.000010, 0.000010, 0.000000)
   7: mov o0.w, l(1.000000)
   8: ret 
// Approximately 0 instruction slots used

In addition, as a result, since two normal maps were collected and processed at once, more commands were also saved. Or, when different texture sets are required, you can record different texture information in the NormalMap’s B and A channels.

In the case of the optimization function, since the part for z is simply defined as a constant, the calculation for each pixel of the pixel normal is also minimized.

Also, since two set textures were processed with one set texture, the optimization for the CPU bottleneck was also performed.

Custom normal texture encoding function creation How to. (Korean Version)

Custom normal texture encoding function creation How to.

모바일 게임을 제작 하다 보면 여러가지 문제점을 만나게 된다.

그 중에서 하나 대표적인 것인 CPU 에서 빈번하게 발생 하게 되는 Draw-Call 이외에도 Texture fetching 이 빈번하게 발생 하는 부분이다.

최적화를 하다 보면 경력자 아티스트도 이 부분을 어느정도 들은 적이 있기 때문에 최대한 Sampler 를 많이 만들지 않으려고 노력 한다.

하지만 근본적인 원인은 잘 알고 있지 않다.

아무튼 하드웨어 적인 메카니즘에 대한 이야기를 하려는 것은 아니다.

이런 이야기를 이해 한다는 것은 내가 생각 할 때 굳이 내 글을 볼 필요가 없는 수준에 이미 도달 한 숙련자 일 것이기 때문이다.

어쨌든 Sampler 를 최대한 소량으로 선언 하는 것이 좋다는 것만 일단 알고 있자.

보통 Sampler2D 또는 tex2D 는 자주 봤을 것인데 이것은 CPU 에서 Texture Fetching 을 요구 한다.

결국 Draw-call 처럼 직렬 처리 되는 Texture Fetching 의 빈도가 높다는 것은 그렇게 기쁜 소식은 아닐 것이다.

피부 셰이더를 개발 하거나 기타 조금 복잡한 셰이더를 만들 경우에 한 장 이상의 노말맵을 사용 해야 하는 경우는 꼭 발생한다.

결국 두 장의 노말맵을 어떻게 한 장으로 묶어 줄 수 있는가에 대한 의문을 갖게 될 것이다.

수학적인 처리 방법은 그다지 관심이 없을 듯 하기 때문에 본론으로 바로 들어가서 어떻게 구현 할 수 있는지 함께 살펴보자.

또한 이 내요을 좀 더 쉽게 풀어가기 위해서 나는 Amplify shader Editor 를 사용 했다.

URP 에서 같은 내용을 다룰 것이지만 지금은 일단 이 내용을 충실하게 살펴보도록 하자.

노말맵핑이 뭔지 설마 모르지 않겠지만 그래도 여기에 더 자세한 정보가 담겨 있다.

굳이 노말맵이 더 궁금하다면 전직 렌더링 프로그래머였던 포프님의 유투브 체널을 시청 해 보자.

영상을 볼 때 노말맵의 Y 체널에 대한 이야기가 언급 되는데 이 부분은 좀 더 분명하게 이해 해야 할 필요가 있는 것 같다.

영상으로 포프님이 만들다 보니 설명이 조금 터프 한 느낌이다.

왼손 좌표계 오른손 좌표계 까지 이야기를 하는 것은 너무 복잡한 내용이 될 수 있기 떄문에 간단히 정리 해 보자.

다이렉트 엑스 렌더링과 오픈지엘 렌더링의 경우 UV 의 원점 위치가 일단 다르다.

결국 이게 범인이다.

다시 본론으로 돌아 가자.

Unity3D built-in Shader 코드를 참조 하자.

URP Core 의 ShaderLinrary/Packing.hlsl 을 참조 했다.

// Unpack from normal map
real3 UnpackNormalRGB(real4 packedNormal, real scale = 1.0)
{
    real3 normal;
    normal.xyz = packedNormal.rgb * 2.0 - 1.0;
    normal.xy *= scale;
    return normalize(normal);
}

위 두 함수를 참조 하여 코드를 목적에 부합 하도록 수정 하고 새로운 함수를 추가 하자.
추가 한 새로운 함수는 Amplify Shader Editor 의 Custom Expression 을 사용 하여 테스트 해 볼 것이다.

분명한 것은 이 새로운 함수의 목적이다.
1. 두 장의 노말맵을 한장의 노말맵으로 합쳐서 한번의 Set texture 만 발생 하도록 하자.
2. 모바일 게임용의 최적화 기본 방안을 생각하자. 예로 들어서 ALU 를 최소화 하기 위한 Approximation 에 대한 편차 있는 결과에 대해서 너그럽게 받아들이자.

두 개의 함수 형태로 재구성 했다.

JP_UnpackNormalRG_SafeNormal

inline float3 JP_UnpackNormalRG_SafeNormal( half2 normalXY )
	{
		half3 normal;
		normal.xy = normalXY.xy * 2 - 1;
		normal.z = sqrt(1 - saturate(dot(normal.xy, normal.xy)));
		return normalize(normal);
	}

JP_UnpackNormalRG_SafeNormal_Optimal

inline float3 JP_UnpackNormalRG_SafeNormal_Optimal( half2 normalXY )
	{
	     return normalize(half3(normalXY.xy * 2 - 1 , 1));
	}

P_UnpackNormalRG_SafeNormal
JP_UnpackNormalRG_SafeNormal_Optimal

두 함수를 시각적으로 Debug 하기 위해 TransformDirection 을 사용 하고 비교 했다.


위 두 식을 컴파일 하고 ALU 수량도 비교 해 보자.

Disassemble code 상에서 명령어 수량은 3개 차이가 있다.

JP_UnpackNormalRG_SafeNormal 의 Disassemble code block.

// SV_Target                0   xyzw        0   TARGET   float   xyzw
      ps_4_0
      dcl_constantbuffer CB0[5], immediateIndexed
      dcl_sampler s0, mode_default
      dcl_resource_texture2d (float,float,float,float) t0
      dcl_input_ps linear v1.xy
      dcl_output o0.xyzw
      dcl_temps 1
   0: mad r0.xy, v1.xyxx, cb0[4].xyxx, cb0[4].zwzz
   1: sample r0.xyzw, r0.xyxx, t0.xyzw, s0
   2: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), l(-1.000000, -1.000000, 0.000000, 0.000000)
   3: dp2 r0.w, r0.xyxx, r0.xyxx
   4: min r0.w, r0.w, l(1.000000)
   5: add r0.w, -r0.w, l(1.000000)
   6: sqrt r0.z, r0.w
   7: dp3 r0.w, r0.xyzx, r0.xyzx
   8: rsq r0.w, r0.w
   9: mad o0.xyz, r0.xyzx, r0.wwww, l(0.000010, 0.000010, 0.000010, 0.000000)
  10: mov o0.w, l(1.000000)
  11: ret 
// Approximately 0 instruction slots used

JP_UnpackNormalRG_SafeNormal_Optimal 의 Disassemble code block.

// SV_Target                0   xyzw        0   TARGET   float   xyzw
      ps_4_0
      dcl_constantbuffer CB0[5], immediateIndexed
      dcl_sampler s0, mode_default
      dcl_resource_texture2d (float,float,float,float) t0
      dcl_input_ps linear v1.xy
      dcl_output o0.xyzw
      dcl_temps 1
   0: mad r0.xy, v1.xyxx, cb0[4].xyxx, cb0[4].zwzz
   1: sample r0.xyzw, r0.xyxx, t0.xyzw, s0
   2: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), l(-1.000000, -1.000000, 0.000000, 0.000000)
   3: mov r0.z, l(1.000000)
   4: dp3 r0.w, r0.xyzx, r0.xyzx
   5: rsq r0.w, r0.w
   6: mad o0.xyz, r0.xyzx, r0.wwww, l(0.000010, 0.000010, 0.000010, 0.000000)
   7: mov o0.w, l(1.000000)
   8: ret 
// Approximately 0 instruction slots used

또한 결론적으로 두 장의 노말맵을 한번에 모아서 처리 하고 있기 때문에 더 많은 명령어도 절약 된 결과가 되었다.
또는 여러 다른 텍스처 셋트가 요구 될 때 NormalMap 의 B 와 A 체널에 다른 텍스처 정보를 기록 할 수도 있다.

최적화 함수의 경우 z 에 대한 부분을 단순히 고정상수로 정의 했기 때문에 Pixel normal 의 각 pixel 당 계산도 최소화 한 것이 된다.

그리고 두 번의 Set texture 처리를 한 번의 Set texture 로 처리 했기 때문에 CPU 병목지점에 대한 최적화도 수행 한 결과가 된다.

Substance Painter Shader Guide part 3. – Adding to Gaussian Blur9

Summary.

What you’re practicing in this post may be helpful when implementing a skin shader in the next.
The goal is to understand the depth of each content by first implementing it individually rather than connecting all the function implementations. .
Again, use pbr-metal-roughness.glsl, open the shader code and add the bool variable as shown below.

//-------- BLur --------------------------------------------//
//: param custom { "default": false, "label": "Blur" }
uniform bool b_blur;

The b_blur variable will work in pairs with //: param custom {“default”: false, “label”: “blur”}.

For the Blur function to be used in this exercise, refer to the link below.

https://github.com/Jam3/glsl-fast-gaussian-blur?fbclid=IwAR3sqv_wuC4TLNHTdcEnI88WylOIIpFZIyYGtTizmN90oqeiwQDaR3wlu9Y

vec4 blur9(sampler2D image, vec2 uv, vec2 resolution, vec2 direction) {
  vec4 color = vec4(0.0);
  vec2 off1 = vec2(1.3846153846) * direction;
  vec2 off2 = vec2(3.2307692308) * direction;
  color += texture2D(image, uv) * 0.2270270270;
  color += texture2D(image, uv + (off1 / resolution)) * 0.3162162162;
  color += texture2D(image, uv - (off1 / resolution)) * 0.3162162162;
  color += texture2D(image, uv + (off2 / resolution)) * 0.0702702703;
  color += texture2D(image, uv - (off2 / resolution)) * 0.0702702703;
  return color;
}

Let’s modify the above reference code to fit the Substance Painter API. Let’s take a look at the documentation to reference the internal API.

1. Sparse API
file:///C:/Program%20Files/Allegorithmic/Substance%20Painter/resources/shader-doc/lib-sparse.html

2. Sampler API
file:///C:/Program%20Files/Allegorithmic/Substance%20Painter/resources/shader-doc/lib-sampler.html

//blur9 function here
vec4 blur9( SamplerSparse image , SparseCoord uv, vec2 resolution, vec2 direction) 
{
  vec4 color = vec4(0.0);
  vec2 off1 = vec2(1.3846153846) * direction;
  vec2 off2 = vec2(3.2307692308) * direction;

  color += getBaseColor(image, uv) * 0.2270270270;
  uv +=(off1 / resolution);
  color += getBaseColor(image, uv) * 0.3162162162;
  uv -=(off1 / resolution);
  color += getBaseColor(image, uv) * 0.3162162162;
  uv +=(off2 / resolution);
  color += getBaseColor(image, uv) * 0.0702702703;
  uv-=(off2 / resolution);
  color += getBaseColor(image, uv) * 0.0702702703;
  return color;
}

Now that we have the proper function, let’s apply it to Albedo.

%d 블로거가 이것을 좋아합니다: