64 位模运算的奇怪性能行为

发布于 2024-08-18 06:33:27 字数 2057 浏览 5 评论 0原文

最后三个方法调用大约需要花费时间。比前四个时间多了一倍。

唯一的区别是它们的参数不再适合整数。但这有关系吗?该参数被声明为long,因此无论如何都应该使用long进行计算。对于数字>maxint,模运算是否使用另一种算法?

我使用的是 amd athlon64 3200+、winxp sp3 和 vs2008。

       Stopwatch sw = new Stopwatch();
       TestLong(sw, int.MaxValue - 3l);
       TestLong(sw, int.MaxValue - 2l);
       TestLong(sw, int.MaxValue - 1l);
       TestLong(sw, int.MaxValue);
       TestLong(sw, int.MaxValue + 1l);
       TestLong(sw, int.MaxValue + 2l);
       TestLong(sw, int.MaxValue + 3l);
       Console.ReadLine();

    static void TestLong(Stopwatch sw, long num)
    {
        long n = 0;
        sw.Reset();
        sw.Start();
        for (long i = 3; i < 20000000; i++)
        {
            n += num % i;
        }
        sw.Stop();
        Console.WriteLine(sw.Elapsed);            
    }

编辑: 我现在尝试使用 C 进行相同的操作,但问题不会出现在这里,所有模运算在发布和调试模式下都需要相同的时间,无论是否打开优化:

#include "stdafx.h"
#include "time.h"
#include "limits.h"

static void TestLong(long long num)
{
    long long n = 0;

    clock_t t = clock();
    for (long long i = 3; i < 20000000LL*100; i++)
    {
        n += num % i;
    }

    printf("%d - %lld\n", clock()-t, n);  
}

int main()
{
    printf("%i %i %i %i\n\n", sizeof (int), sizeof(long), sizeof(long long), sizeof(void*));

    TestLong(3);
    TestLong(10);
    TestLong(131);
    TestLong(INT_MAX - 1L);
    TestLong(UINT_MAX +1LL);
    TestLong(INT_MAX + 1LL);
    TestLong(LLONG_MAX-1LL);

    getchar();
    return 0;
}

编辑2:

感谢您的宝贵建议。我发现 .net 和 c(在调试和发布模式下)都不会使用原子 cpu 指令来计算余数,但它们会调用一个函数来计算余数。

在c程序中我可以得到它的名称“_allrem”。它还显示了该文件的完整源注释,因此我发现该算法特例使用 32 位除数而不是 .net 应用程序中的除数。

我还发现,c 程序的性能实际上只受除数的值影响,而不受除数的影响。另一项测试表明,.net 程序中余数函数的性能取决于被除数和除数。

顺便说一句:即使是 long long 值的简单加法也是通过连续的 add 和 adc 指令来计算的。因此,即使我的处理器称自己为 64 位,它实际上也不是 :(

EDIT3:

我现在在 Windows 7 x64 版本上运行 c 应用程序,使用 Visual Studio 2010 编译。有趣的是,尽管现在(我检查了汇编源代码)使用了真正的 64 位指令,但性能行为保持不变。

The last three of these method calls take approx. double the time than the first four.

The only difference is that their arguments doesn't fit in integer anymore. But should this matter? The parameter is declared to be long, so it should use long for calculation anyway. Does the modulo operation use another algorithm for numbers>maxint?

I am using amd athlon64 3200+, winxp sp3 and vs2008.

       Stopwatch sw = new Stopwatch();
       TestLong(sw, int.MaxValue - 3l);
       TestLong(sw, int.MaxValue - 2l);
       TestLong(sw, int.MaxValue - 1l);
       TestLong(sw, int.MaxValue);
       TestLong(sw, int.MaxValue + 1l);
       TestLong(sw, int.MaxValue + 2l);
       TestLong(sw, int.MaxValue + 3l);
       Console.ReadLine();

    static void TestLong(Stopwatch sw, long num)
    {
        long n = 0;
        sw.Reset();
        sw.Start();
        for (long i = 3; i < 20000000; i++)
        {
            n += num % i;
        }
        sw.Stop();
        Console.WriteLine(sw.Elapsed);            
    }

EDIT:
I now tried the same with C and the issue does not occur here, all modulo operations take the same time, in release and in debug mode with and without optimizations turned on:

#include "stdafx.h"
#include "time.h"
#include "limits.h"

static void TestLong(long long num)
{
    long long n = 0;

    clock_t t = clock();
    for (long long i = 3; i < 20000000LL*100; i++)
    {
        n += num % i;
    }

    printf("%d - %lld\n", clock()-t, n);  
}

int main()
{
    printf("%i %i %i %i\n\n", sizeof (int), sizeof(long), sizeof(long long), sizeof(void*));

    TestLong(3);
    TestLong(10);
    TestLong(131);
    TestLong(INT_MAX - 1L);
    TestLong(UINT_MAX +1LL);
    TestLong(INT_MAX + 1LL);
    TestLong(LLONG_MAX-1LL);

    getchar();
    return 0;
}

EDIT2:

Thanks for the great suggestions. I found that both .net and c (in debug as well as in release mode) does't not use atomically cpu instructions to calculate the remainder but they call a function that does.

In the c program I could get the name of it which is "_allrem". It also displayed full source comments for this file so I found the information that this algorithm special cases the 32bit divisors instead of dividends which was the case in the .net application.

I also found out that the performance of the c program really is only affected by the value of the divisor but not the dividend. Another test showed that the performance of the remainder function in the .net program depends on both the dividend and divisor.

BTW: Even simple additions of long long values are calculated by a consecutive add and adc instructions. So even if my processor calls itself 64bit, it really isn't :(

EDIT3:

I now ran the c app on a windows 7 x64 edition, compiled with visual studio 2010. The funny thing is, the performance behavior stays the same, although now (I checked the assembly source) true 64 bit instructions are used.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

倒数 2024-08-25 06:33:27

多么奇怪的观察啊。您可以采取以下措施来进一步研究这一点:在程序的开头添加一个“暂停”,例如 Console.ReadLine,但在第一次调用您的方法之后。然后以“发布”模式构建程序。然后启动程序而不是在调试器中。然后,在暂停时附加调试器。调试它并查看为相关方法编写的代码。找到循环体应该很容易。

了解生成的循环体与 C 程序中的循环体有何不同将会很有趣。

跳过所有这些障碍的原因是,抖动会改变在抖动“调试”程序集时生成的代码在抖动已附加调试器的程序时;在这些情况下,它会生成在调试器中更容易理解的代码。看看抖动认为是针对这种情况生成的“最佳”代码会更有趣,因此您必须在抖动运行后稍后附加调试器。

What a curious observation. Here's something you can do to investigate this further: add a "pause" at the beginning of the program, like a Console.ReadLine, but AFTER the first call to your method. Then build the program in "release" mode. Then start the program not in the debugger. Then, at the pause, attach the debugger. Debug through it and take a look at the code jitted for the method in question. It should be pretty easy to find the loop body.

It would be interesting to know how the generated loop body differs from that in your C program.

The reason for all those hoops to jump through is because the jitter changes what code it generates when jitting a "debug" assembly or when jitting a program that already has a debugger attached; it jits code that is easier to understand in a debugger in those cases. It would be more interesting to see what the jitter thinks is the "best" code generated for this case, so you have to attach the debugger late, after the jitter has run.

亣腦蒛氧 2024-08-25 06:33:27

您是否尝试过在您的盒子上以本机代码执行相同的操作?

如果本机 64 位余数运算出现两个参数都在 32 位范围内的特殊情况(基本上将其委托给 32 位运算),我不会感到惊讶。 (或者可能是 JIT 这样做...)优化这种情况确实很有意义,不是吗?

Have you tried performing the same operations in native code on your box?

I wouldn't be surprised if the native 64-bit remainder operation special-cased situations where both arguments are within the 32-bit range, basically delegating that to the 32-bit operation. (Or possibly it's the JIT that does that...) It does make a fair amount of sense to optimise that case, doesn't it?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文