Unity中CPU和GPU着色器的移动对象的速度的差异
我一直在测试通过普通C#代码和HLSL着色器的统一移动许多对象。但是,速度没有差异。 FPS保持不变。使用不同的Perlin噪声来改变位置。 C#代码使用标准Mathf.perlinnoise,而HLSL使用自定义噪声函数。
方案1-仅通过C#代码
对象产生更新:
[SerializeField]
private GameObject prefab;
private void Start()
{
for (int i = 0; i < 50; i++)
for (int j = 0; j < 50; j++)
{
GameObject createdParticle;
createdParticle = Instantiate(prefab);
createdParticle.transform.position = new Vector3(i * 1f, Random.Range(-1f, 1f), j * 1f);
}
}
通过C#移动对象的代码。将此脚本添加到每个创建的对象:
private Vector3 position = new Vector3();
private void Start()
{
position = new Vector3(transform.position.x, Mathf.PerlinNoise(Time.time, Time.time), transform.position.z);
}
private void Update()
{
position.y = Mathf.PerlinNoise(transform.position.x / 20f + Time.time, transform.position.z / 20f + Time.time) * 5f;
transform.position = position;
}
方案2-通过计算内核(GPGPU)
第1部分:C#客户端代码
对象产生,在着色器上运行计算并将结果值分配给对象:
public struct Particle
{
public Vector3 position;
}
[SerializeField]
private GameObject prefab;
[SerializeField]
private ComputeShader computeShader;
private List<GameObject> particlesList = new List<GameObject>();
private Particle[] particlesDataArray;
private void Start()
{
CreateParticles();
}
private void Update()
{
UpdateParticlePosition();
}
private void CreateParticles()
{
List<Particle> particlesDataList = new List<Particle>();
for (int i = 0; i < 50; i++)
for (int j = 0; j < 50; j++)
{
GameObject createdParticle;
createdParticle = Instantiate(prefab);
createdParticle.transform.position = new Vector3(i * 1f, Random.Range(-1f, 1f), j * 1f);
particlesList.Add(createdParticle);
Particle particle = new Particle();
particle.position = createdParticle.transform.position;
particlesDataList.Add(particle);
}
particlesDataArray = particlesDataList.ToArray();
particlesDataList.Clear();
computeBuffer = new ComputeBuffer(particlesDataArray.Length, sizeof(float) * 7);
computeBuffer.SetData(particlesDataArray);
computeShader.SetBuffer(0, "particles", computeBuffer);
}
private ComputeBuffer computeBuffer;
private void UpdateParticlePosition()
{
computeShader.SetFloat("time", Time.time);
computeShader.Dispatch(computeShader.FindKernel("CSMain"), particlesDataArray.Length / 10, 1, 1);
computeBuffer.GetData(particlesDataArray);
for (int i = 0; i < particlesDataArray.Length; i++)
{
Vector3 pos = particlesList[i].transform.position;
pos.y = particlesDataArray[i].position.y;
particlesList[i].transform.position = pos;
}
}
<强>第2部分:计算内核(GPGPU)
#pragma kernel CSMain
struct Particle {
float3 position;
float4 color;
};
RWStructuredBuffer<Particle> particles;
float time;
float mod(float x, float y)
{
return x - y * floor(x / y);
}
float permute(float x) { return floor(mod(((x * 34.0) + 1.0) * x, 289.0)); }
float3 permute(float3 x) { return mod(((x * 34.0) + 1.0) * x, 289.0); }
float4 permute(float4 x) { return mod(((x * 34.0) + 1.0) * x, 289.0); }
float taylorInvSqrt(float r) { return 1.79284291400159 - 0.85373472095314 * r; }
float4 taylorInvSqrt(float4 r) { return float4(taylorInvSqrt(r.x), taylorInvSqrt(r.y), taylorInvSqrt(r.z), taylorInvSqrt(r.w)); }
float3 rand3(float3 c) {
float j = 4096.0 * sin(dot(c, float3(17.0, 59.4, 15.0)));
float3 r;
r.z = frac(512.0 * j);
j *= .125;
r.x = frac(512.0 * j);
j *= .125;
r.y = frac(512.0 * j);
return r - 0.5;
}
float _snoise(float3 p) {
const float F3 = 0.3333333;
const float G3 = 0.1666667;
float3 s = floor(p + dot(p, float3(F3, F3, F3)));
float3 x = p - s + dot(s, float3(G3, G3, G3));
float3 e = step(float3(0.0, 0.0, 0.0), x - x.yzx);
float3 i1 = e * (1.0 - e.zxy);
float3 i2 = 1.0 - e.zxy * (1.0 - e);
float3 x1 = x - i1 + G3;
float3 x2 = x - i2 + 2.0 * G3;
float3 x3 = x - 1.0 + 3.0 * G3;
float4 w, d;
w.x = dot(x, x);
w.y = dot(x1, x1);
w.z = dot(x2, x2);
w.w = dot(x3, x3);
w = max(0.6 - w, 0.0);
d.x = dot(rand3(s), x);
d.y = dot(rand3(s + i1), x1);
d.z = dot(rand3(s + i2), x2);
d.w = dot(rand3(s + 1.0), x3);
w *= w;
w *= w;
d *= w;
return dot(d, float4(52.0, 52.0, 52.0, 52.0));
}
[numthreads(10, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
Particle particle = particles[id.x];
float modifyTime = time / 5.0;
float positionY = _snoise(float3(particle.position.x / 20.0 + modifyTime, 0.0, particle.position.z / 20.0 + modifyTime)) * 5.0;
particle.position = float3(particle.position.x, positionY, particle.position.z);
particles[id.x] = particle;
}
我在做什么错,为什么计算速度没有增加? :)
提前致谢!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
tl; dr:您的gpgpu(计算着色器)场景是未取代的,从而偏向您的结果。考虑将材料绑定到
ComputeBuffer
,并通过graphics.drawPrococedural
渲染。这样,一切都停留在GPU上。OP:
本质上,您的问题有两个部分。
(1)GPU的阅读速度很慢,
大多数与GPU相关的内容,您通常希望避免从GPU阅读,因为它会阻止CPU。对于GPGPU方案也是如此。
如果我要危害一个猜测,那将是gpgpu(compute dacker)call
computebuffer.getdata()
如下所示:unity(我的重点):
(2)(2)explicit GPU阅读是您的情况
我可以看到您正在创建 2,500 “粒子”,其中每个粒子都附加到
gameObject
。如果目的是仅绘制一个简单的四分之一一口气。证明:请参见下面的视频。 2014年ERA NVIDIA卡上的60+ fps,
例如对于我的 gpgpu n-body星系模拟我就是这样做的。请注意
starmaterial.setBuffer(“星”,_StarsBuffer)
在实际渲染过程中。这告诉GPU使用 已经存在于GPU上的缓冲区,这是计算机着色器用来移动恒星位置的缓冲区。 在这里没有CPU阅读GPU。
我认为每个人都可以同意Microsoft的GPGPU文档非常稀疏,因此最好的选择是查看散布在Interwebs周围的示例。想到的是,出色的“ unity in Unity中的GPU Ray Tracing” 系列在三场眼游戏中。请参阅下面的链接。
另请参阅:
TL;DR: your GPGPU (compute shader) scenario is unoptimized thus skewing your results. Consider binding a material to the
computeBuffer
and rendering viaGraphics.DrawProcedural
. That way everything stays on the GPU.OP:
Essentially, there are two parts to your problem.
(1) Reading from the GPU is slow
With most things GPU-related, you generally want to avoid reading from the GPU since it will block the CPU. This is true also for GPGPU scenarios.
If I were to hazard a guess it would be the GPGPU (compute shader) call
computeBuffer.GetData()
shown below:Unity (my emphasis):
(2) Explicit GPU reading is not required in your scenario
I can see you are creating 2,500 "particles" where each particle is attached to a
GameObject
. If the intent is to just draw a simple quad then it's more efficient to create an arraystruct
s containing aVector3
position and then performing a batch render call to draw all the particles in one go.Proof: see video below of nBody simulation. 60+ FPS on 2014 era NVidia card
e.g. for my GPGPU n-Body Galaxy Simulation I do just that. Pay attention to the
StarMaterial.SetBuffer("stars", _starsBuffer)
during actual rendering. That tells the GPU to use the buffer that already exists on the GPU, the very same buffer that the computer shader used to move the star positions. There is no CPU reading the GPU here.n-Body galaxy simulation of 10,000 stars:

I think everyone can agree that Microsoft's GPGPU documentation is pretty sparse so your best bet is to check out examples scattered around the interwebs. One that comes to mind is the excellent "GPU Ray Tracing in Unity" series over at Three Eyed Games. See the link below.
See also:
ComputeBuffer.getData很长。 CPU从GPU复制数据。这停止了主线程。
然后,您绕着所有转换来改变其位置,这肯定比数千个Monobehaviour更快,但也很长。
有两种优化代码的方法。
cpu
c#作业系统 + a> + surbs
详细的教程: https://github.com/stella3d/job-system-job-system-cookbook
gpu
使用在计算着色器中计算的结构化缓冲区,而无需将其复制回CPU。这是有关如何做的详细教程:
httpps://catlikecoding.com/unity/unity/tutorials/tutorials/basics/basics/basics/compute-shaders/-compute-shaders/ /a>
ComputeBuffer.GetData is very long. The CPU copies data from the GPU. This stops the main thread.
Then you loop around all transforms to change their positions, this is certainly faster than thousands of MonoBehaviour, but also very long.
There are two ways to optimize your code.
CPU
C# Job System + Burst
Detailed tutorial: https://github.com/stella3d/job-system-cookbook
GPU
Use the structured buffer calculated in the compute shader without copying it back to the CPU. Here is a detailed tutorial on how to do it:
https://catlikecoding.com/unity/tutorials/basics/compute-shaders/