展开循环会影响内部计算的准确性吗?
问题总结 展开循环是否会影响循环内执行的计算的准确性?如果是这样,为什么?
详细说明和背景 我正在使用 HLSL 编写一个计算着色器,用于 Unity 项目 (2021.2.9f1)。我的部分代码包括数值过程和高度振荡函数,这意味着高计算精度至关重要。
当将我的结果与 Python 中的等效过程进行比较时,我注意到在 1e-5 的顺序上存在一些偏差。这是令人担忧的,因为我没想到如此大的误差是精度差异的结果,例如,三角函数中的浮点精度或 HLSL 中的幂函数。
最后,经过多次调试,我现在相信选择展开或不展开循环是导致偏差的原因。然而,我确实觉得这很奇怪,因为我似乎找不到任何来源表明除了“时空权衡”之外,展开循环还会影响准确性。
为了澄清,如果将我的 Python 结果视为正确的解决方案,在 HLSL 中展开循环会比不展开循环提供更好的结果。
最小工作示例 下面是一个 MWE,其中包含Unity 的 C# 脚本的一个、执行计算的相应计算着色器以及在 Unity 中运行时我的控制台的屏幕截图 (2021.2.9f1)。请原谅我对牛顿法的实施有些混乱,但我选择保留它,因为我相信这可能是造成这种偏差的原因。也就是说,如果简单地计算cos(x)
,那么展开和未展开之间没有区别。尽管如此,我仍然无法理解在测试内核中简单添加 [unroll(N)]
如何改变结果...
// C# for Unity
using UnityEngine;
public class UnrollTest : MonoBehaviour
{
[SerializeField] ComputeShader CS;
ComputeBuffer CBUnrolled, CBNotUnrolled;
readonly int N = 3;
private void Start()
{
CBUnrolled = new ComputeBuffer(N, sizeof(double));
CBNotUnrolled = new ComputeBuffer(N, sizeof(double));
CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);
CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);
double[] ansUnrolled = new double[N];
double[] ansNotUnrolled = new double[N];
CBUnrolled.GetData(ansUnrolled);
CBNotUnrolled.GetData(ansNotUnrolled);
for (int i = 0; i < N; i++)
{
Debug.Log("Unrolled ans = " + ansUnrolled[i] +
" - Not Unrolled ans = " + ansNotUnrolled[i] +
" -- Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
}
CBUnrolled.Release();
CBNotUnrolled.Release();
}
}
#pragma kernel CSMain
RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;
// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}
// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}
// Function to solve.
double f(double fnh, double h, double theta)
{
// Solved using Newton's method.
int max_iter = 50;
double epsilon = 1e-8;
double fxn, dfxn;
// Define initial guess for k, herby denoted as x.
double xn = 10.0;
for (int n = 0; n < max_iter; n++)
{
fxn = fDummy(xn, fnh, h, theta);
if (abs(fxn) < epsilon) // A solution is found.
return xn;
dfxn = dfDummy(xn, fnh, h, theta);
if (dfxn == 0.0) // No solution found.
return xn;
xn = xn - fxn / dfxn;
}
// No solution found.
return xn;
}
[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
int N = 3;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = f(fnh, h, theta);
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)] for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = f(fnh, h, theta);
theta += dtheta;
}
}
编辑 经过更多测试,偏差已缩小到以下代码,给出完全相同之间大约有 1e-17 的差异代码展开与未展开。尽管差异很小,但我仍然认为这是该问题的一个有效示例,因为我相信它们应该是相等的。
[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
if ((int) threadID.x != 1)
return;
int N = 3;
double k = 1.0;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)]
for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
}
编辑 2以下是编辑 1 中给出的内核的编译代码。不幸的是,我对汇编语言的经验有限,我无法发现该脚本是否显示任何错误,或者它是否对当前的问题有用。
**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
// Double-precision floating point
//
//
// Input signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
cs_5_0
dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
dcl_uav_structured u0, 8
dcl_uav_structured u1, 8
dcl_input vThreadID.x
dcl_temps 2
dcl_thread_group 64, 1, 1
0: ine r0.x, vThreadID.x, l(1)
1: if_nz r0.x
2: ret
3: endif
4: dmov r0.xy, d(-0.161000l, 0.000000l)
5: mov r0.z, l(0)
6: loop
7: ige r0.w, r0.z, l(3)
8: breakc_nz r0.w
9: dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
10: dadd r1.xy, -r1.zwzw, r1.xyxy
11: store_structured u1.xy, r0.z, l(0), r1.xyxx
12: dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
13: iadd r0.z, r0.z, l(1)
14: endloop
15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
18: ret
// Approximately 0 instruction slots used
编辑 3 联系 Microsoft 后,(请参阅https://learn.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html ),他们说问题更多地与Unity有关。这是因为
“pragma unroll [(n)]是Unity使用的keil编译器主题”
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是驱动程序,硬件,编译器和统一依赖性。
从本质上讲,与常规IEEE-754浮点相比,HLSL规范对数学操作的四舍五入行为具有更大的保证。
首先,无论操作是围绕还是向下,都取决于实现。
更进一步,HLSL编译器本身具有许多快速的优化,可以违反IEEE-754 Float符合。参见,例如:
这对您的情况尤其重要,因为如果启用了优化,则循环展开的存在可能会触发恒定的折叠优化,从而降低代码的计算成本并更改准确的计算它的结果(甚至可能改善它们)。请注意,当发生恒定折叠时,编译器必须决定如何执行舍入,这可能会不同意您的硬件FPU会做什么。
哦,请注意,IEEE-754在精确度上没有限制,更不用说需要实施“其他操作”(例如SIN,COS,TANH,ATAN,ATAN,LN等);它纯粹建议他们。
另外,请注意,Unity不能保证着色器中的
float
实际上是32位浮点。在某些硬件(例如移动设备)上,它甚至可以由16位半
或11位修复>固定
来支持。我不相信Unity将编译器标志暴露于开发人员;您对将其传递给DXC/FXC的优化息息相关。鉴于它主要用于游戏,您可以打赌它们可以进行优化。
07/16/Bruce Dawson的浮点数确定性/“ rel =” nofollow noreferrer“>“浮点确定论” 如果您想进一步潜入这个话题;我将补充说,如果您想在语言之间保持一致的结果(因为语言本身可以自己实现数学函数而不是使用硬件inntinsics,例如以更好的精度)(因为不同的编译器 /后端可以优化不同,或使用不同的系统库),或者在不同的运行时间内运行托管代码时(例如JIT可以进行不同的优化)。
This is driver, hardware, compiler, and unity dependent.
In essence, the HLSL specification has somewhat looser guarantees for rounding behavior of mathematical operations than regular IEEE-754 floating point.
First, it is implementation-dependent whether operations round up or down.
Going one step further, the HLSL compiler itself has many fast-math optimizations that can violate IEEE-754 float conformance; see, for example:
This particularly matters for your scenario, because if optimizations are enabled, the existence of loop unrolling can trigger constant folding optimizations that reduce the computational cost of your code and change the precision of its results (potentially even improving them). Note that when constant folding occurs, the compiler has to decide how to perform rounding, and that might disagree with what your hardware FPUs would do.
Oh, and note that IEEE-754 does not place constraints on the precision, let alone require implementation, of "additional operations" (e.g. sin, cos, tanh, atan, ln, etc); it purely recommends them.
Also, note that Unity does not guarantee that a
float
in shader is actually a 32-bit float; on certain hardware (e.g. mobile), it can even be backed by a 16-bithalf
or an 11-bitfixed
.I don't believe Unity exposes compiler flags to developers; you are at its whim as to what optimizations it passes to dxc/fxc. Given it's primarily used for games, you can bet they enable optimizations.
Finally, check out "Floating-Point Determinism" by Bruce Dawson if you want an in-depth dive into this topic; I will add that this problem also exists if you want consistent results between languages (since languages themselves can implement math functions themselves rather than using hardware intrinsics, e.g. for better precision), when cross-compiling (since different compilers / backends can optimize differently or use different system libraries), or when running managed code across different runtimes (e.g. since JIT can do different optimiztions).