当前位置：文江博客话题详情

展开循环会影响内部计算的准确性吗？

发布于 2025-01-20 08:24:15 字数 6784 浏览 4 评论 0 原文

问题总结 展开循环是否会影响循环内执行的计算的准确性？如果是这样，为什么？

详细说明和背景 我正在使用 HLSL 编写一个计算着色器，用于 Unity 项目 (2021.2.9f1)。我的部分代码包括数值过程和高度振荡函数，这意味着高计算精度至关重要。

当将我的结果与 Python 中的等效过程进行比较时，我注意到在 1e-5 的顺序上存在一些偏差。这是令人担忧的，因为我没想到如此大的误差是精度差异的结果，例如，三角函数中的浮点精度或 HLSL 中的幂函数。

最后，经过多次调试，我现在相信选择展开或不展开循环是导致偏差的原因。然而，我确实觉得这很奇怪，因为我似乎找不到任何来源表明除了“时空权衡”之外，展开循环还会影响准确性。

为了澄清，如果将我的 Python 结果视为正确的解决方案，在 HLSL 中展开循环会比不展开循环提供更好的结果。

最小工作示例 下面是一个 MWE，其中包含Unity 的 C# 脚本的一个、执行计算的相应计算着色器以及在 Unity 中运行时我的控制台的屏幕截图 (2021.2.9f1)。请原谅我对牛顿法的实施有些混乱，但我选择保留它，因为我相信这可能是造成这种偏差的原因。也就是说，如果简单地计算cos(x)，那么展开和未展开之间没有区别。尽管如此，我仍然无法理解在测试内核中简单添加 [unroll(N)] 如何改变结果...

// C# for Unity
using UnityEngine;

public class UnrollTest : MonoBehaviour
{

    [SerializeField] ComputeShader CS;
    ComputeBuffer CBUnrolled, CBNotUnrolled;
    readonly int N = 3;

    private void Start()
    {

        CBUnrolled = new ComputeBuffer(N, sizeof(double));
        CBNotUnrolled = new ComputeBuffer(N, sizeof(double));

        CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
        CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);

        CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);

        double[] ansUnrolled = new double[N];
        double[] ansNotUnrolled = new double[N];

        CBUnrolled.GetData(ansUnrolled);
        CBNotUnrolled.GetData(ansNotUnrolled);

        for (int i = 0; i < N; i++)
        {
            Debug.Log("Unrolled ans = " + ansUnrolled[i] + 
                "  -  Not Unrolled ans = " + ansNotUnrolled[i] +  
                "  --  Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
        }
        CBUnrolled.Release();
        CBNotUnrolled.Release();
    }
}

#pragma kernel CSMain

RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;

// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
    return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}

// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
    return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}

// Function to solve.
double f(double fnh, double h, double theta)
{
    // Solved using Newton's method.
    int max_iter = 50;
    double epsilon = 1e-8;
    double fxn, dfxn;

    // Define initial guess for k, herby denoted as x.
    double xn = 10.0;

    for (int n = 0; n < max_iter; n++)
    {
        fxn = fDummy(xn, fnh, h, theta);
        
        if (abs(fxn) < epsilon)     // A solution is found.
            return xn;
        
        dfxn = dfDummy(xn, fnh, h, theta);

        if (dfxn == 0.0)    // No solution found.
            return xn;

        xn = xn - fxn / dfxn;
    }

    // No solution found.
    return xn;
}

[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    int N = 3;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;   // Example values.
   
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {   
        _CBNotUnrolled[i] = f(fnh, h, theta);
        theta += dtheta;
    }
    
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;          // Example values.

    [unroll(N)] for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = f(fnh, h, theta);
        theta += dtheta;
    }
}

运行上述代码时的 Unity 控制台图像

编辑经过更多测试，偏差已缩小到以下代码，给出完全相同之间大约有 1e-17 的差异代码展开与未展开。尽管差异很小，但我仍然认为这是该问题的一个有效示例，因为我相信它们应该是相等的。

[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    if ((int) threadID.x != 1)
        return;
    
    int N = 3;
    double k = 1.0;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {
        _CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
   
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    [unroll(N)]
    for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
}

运行上面编辑的脚本时的 Unity 控制台图像

编辑 2以下是编辑 1 中给出的内核的编译代码。不幸的是，我对汇编语言的经验有限，我无法发现该脚本是否显示任何错误，或者它是否对当前的问题有用。

**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
//       Double-precision floating point
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
      cs_5_0
      dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
      dcl_uav_structured u0, 8
      dcl_uav_structured u1, 8
      dcl_input vThreadID.x
      dcl_temps 2
      dcl_thread_group 64, 1, 1
   0: ine r0.x, vThreadID.x, l(1)
   1: if_nz r0.x
   2:   ret 
   3: endif 
   4: dmov r0.xy, d(-0.161000l, 0.000000l)
   5: mov r0.z, l(0)
   6: loop 
   7:   ige r0.w, r0.z, l(3)
   8:   breakc_nz r0.w
   9:   dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
  10:   dadd r1.xy, -r1.zwzw, r1.xyxy
  11:   store_structured u1.xy, r0.z, l(0), r1.xyxx
  12:   dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
  13:   iadd r0.z, r0.z, l(1)
  14: endloop 
  15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
  16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
  17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
  18: ret 
// Approximately 0 instruction slots used

编辑 3 联系 Microsoft 后，（请参阅https://learn.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html ），他们说问题更多地与Unity有关。这是因为

“pragma unroll [(n)]是Unity使用的keil编译器主题”

原文

Summarized question Does unrolling a loop affect the accuracy of the computations performed within the loop? And if so, why?

Elaboration and background I am writing a compute shader using HLSL for use in a Unity-project (2021.2.9f1). Parts of my code include numerical procedures and highly osciallatory functions, meaning that high computational accuracy is essential.

When comparing my results with an equivalent procedure in Python, I noticed that some deviations in the order of 1e-5. This was concerning, as I did not expect such large errors to be the result of precision differences, e.g., the float-precision in trigonometric or power functions in HLSL.

Ultimatley, after much debugging, I now believe the choice of unrolling or not unrolling a loop to be the cause of the deviation. However, I do find this strange, as I can not seem to find any sources indicating that unrolling a loop affects the accuracy in addition to the "space–time tradeoff".

For clarification, if considering my Python results as the correct solution, unrolling the loop in HLSL gives me better results than what not unrolling gives.

Minimal working example Below is an MWE consisting of a C# script for Unity, the corresponding compute shader where the computations are performed and a screen-shot of my console when running in Unity (2021.2.9f1). Forgive me for a somewhat messy implementation of Newtons method, but I chose to keep it since I believe it might be a cause to this deviation. That is, if simply computing cos(x), then there is not difference between the unrolled and not unrolled. None the less, I still fail to understand how the simple addition of [unroll(N)] in the testing kernel changes the result...

// C# for Unity
using UnityEngine;

public class UnrollTest : MonoBehaviour
{

    [SerializeField] ComputeShader CS;
    ComputeBuffer CBUnrolled, CBNotUnrolled;
    readonly int N = 3;

    private void Start()
    {

        CBUnrolled = new ComputeBuffer(N, sizeof(double));
        CBNotUnrolled = new ComputeBuffer(N, sizeof(double));

        CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
        CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);

        CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);

        double[] ansUnrolled = new double[N];
        double[] ansNotUnrolled = new double[N];

        CBUnrolled.GetData(ansUnrolled);
        CBNotUnrolled.GetData(ansNotUnrolled);

        for (int i = 0; i < N; i++)
        {
            Debug.Log("Unrolled ans = " + ansUnrolled[i] + 
                "  -  Not Unrolled ans = " + ansNotUnrolled[i] +  
                "  --  Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
        }
        CBUnrolled.Release();
        CBNotUnrolled.Release();
    }
}

#pragma kernel CSMain

RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;

// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
    return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}

// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
    return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}

// Function to solve.
double f(double fnh, double h, double theta)
{
    // Solved using Newton's method.
    int max_iter = 50;
    double epsilon = 1e-8;
    double fxn, dfxn;

    // Define initial guess for k, herby denoted as x.
    double xn = 10.0;

    for (int n = 0; n < max_iter; n++)
    {
        fxn = fDummy(xn, fnh, h, theta);
        
        if (abs(fxn) < epsilon)     // A solution is found.
            return xn;
        
        dfxn = dfDummy(xn, fnh, h, theta);

        if (dfxn == 0.0)    // No solution found.
            return xn;

        xn = xn - fxn / dfxn;
    }

    // No solution found.
    return xn;
}

[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    int N = 3;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;   // Example values.
   
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {   
        _CBNotUnrolled[i] = f(fnh, h, theta);
        theta += dtheta;
    }
    
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;          // Example values.

    [unroll(N)] for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = f(fnh, h, theta);
        theta += dtheta;
    }
}

Image of Unity console when running the above

Edit After some more testing, the deviation has been narrowed down to the following code, giving a difference of about 1e-17 between the exact same code unrolled vs not unrolled. Despite the small difference, I still consider it a valid example of the issue, as I believe they should be equal.

[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    if ((int) threadID.x != 1)
        return;
    
    int N = 3;
    double k = 1.0;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {
        _CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
   
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    [unroll(N)]
    for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
}

Image of Unity console when running the edited script above

Edit 2 The following is the compiled code for the kernel given in Edit 1. Unfortunately, my experience with assembly language is limited, and I am not capable of spotting if this script shows any errors, or if it is useful to the problem at hand.

**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
//       Double-precision floating point
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
      cs_5_0
      dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
      dcl_uav_structured u0, 8
      dcl_uav_structured u1, 8
      dcl_input vThreadID.x
      dcl_temps 2
      dcl_thread_group 64, 1, 1
   0: ine r0.x, vThreadID.x, l(1)
   1: if_nz r0.x
   2:   ret 
   3: endif 
   4: dmov r0.xy, d(-0.161000l, 0.000000l)
   5: mov r0.z, l(0)
   6: loop 
   7:   ige r0.w, r0.z, l(3)
   8:   breakc_nz r0.w
   9:   dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
  10:   dadd r1.xy, -r1.zwzw, r1.xyxy
  11:   store_structured u1.xy, r0.z, l(0), r1.xyxx
  12:   dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
  13:   iadd r0.z, r0.z, l(1)
  14: endloop 
  15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
  16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
  17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
  18: ret 
// Approximately 0 instruction slots used

Edit 3 After reaching out to Microsoft, (see https://learn.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html), they stated that the problem is more about Unity. This because

"The pragma unroll [(n)] is keil compiler which Unity uses topic"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

锦上情书 2025-01-27 08:24:15

这是驱动程序，硬件，编译器和统一依赖性。

从本质上讲，与常规IEEE-754浮点相比，HLSL规范对数学操作的四舍五入行为具有更大的保证。

首先，无论操作是围绕还是向下，都取决于实现。

IEEE-754需要浮点操作以产生一个结果
是最接近的代表值，与无限精确的结果，
被称为圆形最佳。但是，Direct3d 10定义了一个宽松的
要求：32位浮点操作产生的结果是
在一个无限精确的结果的一个单位位置（1 ULP）中。
这意味着，例如，允许硬件截断结果
到32位而不是执行最循环的最佳状态，因为那会
导致错误最多为一个ULP

参见 https://lealen.microsoft.com/en-us/windows/windows/wind32/direct3d10/d3d10-graphics-programicm-programpomming-guide-guide-float-float-rules#32-bit-floating-point-floating-Point-rules

更进一步，HLSL编译器本身具有许多快速的优化，可以违反IEEE-754 Float符合。参见，例如：

d3dcompile_ieee_strictness-强制编译，这可能不允许使用旧语法。默认情况下，编译器会严格对弃用语法的严格性。
d3dcompile_optimization_level3-指示编译器使用最高的优化级别。如果设置此常数，则编译器会产生最好的代码，但可能需要更长的时间才能这样做。在性能是最重要的因素时，将此常数设置为最终构建应用程序。
D3DCOMPILE_PARTIAL_PRECISION-指示编译器以部分精度执行所有计算。如果设置此常数，则编译的代码可能会在某些硬件上运行更快。

来源：

这对您的情况尤其重要，因为如果启用了优化，则循环展开的存在可能会触发恒定的折叠优化，从而降低代码的计算成本并更改准确的计算它的结果（甚至可能改善它们）。请注意，当发生恒定折叠时，编译器必须决定如何执行舍入，这可能会不同意您的硬件FPU会做什么。

哦，请注意，IEEE-754在精确度上没有限制，更不用说需要实施“其他操作”（例如SIN，COS，TANH，ATAN，ATAN，LN等）；它纯粹建议他们。

请参阅这是一个非常常见的情况，即出现问题并且罪被量化为Intel Integrated Graphics上的4个不同值，但否则在替代硬件上具有合理的精度： sin（x）仅返回4个不同的值，用于GLSL碎片着色器上的中等大输入，Intel HD4000

另外，请注意，Unity不能保证着色器中的 float 实际上是32位浮点。在某些硬件（例如移动设备）上，它甚至可以由16位半或11位修复>固定来支持。

高精度：浮动
最高精度浮点值；通常32位（就像普通编程语言的浮动一样）。

...
浮点/半/固定数据类型使用的一种并发症是PC GPU始终是高精度。也就是说，对于所有PC（Windows/Mac/Linux）GPU，无论您在着色器中编写Float，一半还是固定的数据类型都无关紧要。他们总是以完整的32位浮点精度计算所有内容。

一半和固定类型仅在针对移动
GPU，这些类型主要存在于权力（有时
性能）约束。请记住，您需要测试
手机上的着色器，以查看您是否遇到
精度/数值问题。

即使在移动GPU上，不同的精度支持也有所不同
GPU家庭。

来源： https://docs.unity3d.com/manual/sl-datatypesandprecision.html < /a>

我不相信Unity将编译器标志暴露于开发人员；您对将其传递给DXC/FXC的优化息息相关。鉴于它主要用于游戏，您可以打赌它们可以进行优化。

来源：

07/16/Bruce Dawson的浮点数确定性/“ rel =” nofollow noreferrer“>“浮点确定论” 如果您想进一步潜入这个话题；我将补充说，如果您想在语言之间保持一致的结果（因为语言本身可以自己实现数学函数而不是使用硬件inntinsics，例如以更好的精度）（因为不同的编译器 /后端可以优化不同，或使用不同的系统库），或者在不同的运行时间内运行托管代码时（例如JIT可以进行不同的优化）。

This is driver, hardware, compiler, and unity dependent.

In essence, the HLSL specification has somewhat looser guarantees for rounding behavior of mathematical operations than regular IEEE-754 floating point.

First, it is implementation-dependent whether operations round up or down.

IEEE-754 requires floating-point operations to produce a result that
is the nearest representable value to an infinitely-precise result,
known as round-to-nearest-even. Direct3D 10, however, defines a looser
requirement: 32-bit floating-point operations produce a result that is
within one unit-last-place (1 ULP) of the infinitely-precise result.
This means that, for example, hardware is allowed to truncate results
to 32-bit rather than perform round-to-nearest-even, as that would
result in error of at most one ULP.

See https://learn.microsoft.com/en-us/windows/win32/direct3d10/d3d10-graphics-programming-guide-resources-float-rules#32-bit-floating-point-rules

Going one step further, the HLSL compiler itself has many fast-math optimizations that can violate IEEE-754 float conformance; see, for example:

D3DCOMPILE_IEEE_STRICTNESS - Forces strict compile, which might not allow for legacy syntax. By default, the compiler disables strictness on deprecated syntax.
D3DCOMPILE_OPTIMIZATION_LEVEL3 - Directs the compiler to use the highest optimization level. If you set this constant, the compiler produces the best possible code but might take significantly longer to do so. Set this constant for final builds of an application when performance is the most important factor.
D3DCOMPILE_PARTIAL_PRECISION - Directs the compiler to perform all computations with partial precision. If you set this constant, the compiled code might run faster on some hardware.

Source: https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/d3dcompile-constants

This particularly matters for your scenario, because if optimizations are enabled, the existence of loop unrolling can trigger constant folding optimizations that reduce the computational cost of your code and change the precision of its results (potentially even improving them). Note that when constant folding occurs, the compiler has to decide how to perform rounding, and that might disagree with what your hardware FPUs would do.

Oh, and note that IEEE-754 does not place constraints on the precision, let alone require implementation, of "additional operations" (e.g. sin, cos, tanh, atan, ln, etc); it purely recommends them.

See, a very common case where this goes wrong and sin gets quantized to 4 different values on intel integrated graphics, but otherwise has reasonable precision on alternative hardware: sin(x) only returns 4 different values for moderately large input on GLSL fragment shader, Intel HD4000

Also, note that Unity does not guarantee that a float in shader is actually a 32-bit float; on certain hardware (e.g. mobile), it can even be backed by a 16-bit half or an 11-bit fixed.

High precision: float
Highest precision floating point value; generally 32 bits (just like float from regular programming languages).

...
One complication of float/half/fixed data type usage is that PC GPUs are always high precision. That is, for all the PC (Windows/Mac/Linux) GPUs, it does not matter whether you write float, half or fixed data types in your shaders. They always compute everything in full 32-bit floating point precision.

The half and fixed types only become relevant when targeting mobile
GPUs, where these types primarily exist for power (and sometimes
performance) constraints. Keep in mind that you need to test your
shaders on mobile to see whether or not you are running into
precision/numerical issues.

Even on mobile GPUs, the different precision support varies between
GPU families.

Source: https://docs.unity3d.com/Manual/SL-DataTypesAndPrecision.html

I don't believe Unity exposes compiler flags to developers; you are at its whim as to what optimizations it passes to dxc/fxc. Given it's primarily used for games, you can bet they enable optimizations.

Source: https://forum.unity.com/threads/possible-to-set-directx-compiler-flags-in-shaders.453790/

Finally, check out "Floating-Point Determinism" by Bruce Dawson if you want an in-depth dive into this topic; I will add that this problem also exists if you want consistent results between languages (since languages themselves can implement math functions themselves rather than using hardware intrinsics, e.g. for better precision), when cross-compiling (since different compilers / backends can optimize differently or use different system libraries), or when running managed code across different runtimes (e.g. since JIT can do different optimiztions).

回复收藏 0 原文

~没有更多了~