使用 RDTSC 在 C 中计算 CPU 频率始终返回 0

发布于 2024-09-01 12:42:02 字数 1314 浏览 12 评论 0原文

我们的讲师向我们提供了以下代码,以便我们可以测量一些算法的性能:

#include <stdio.h>
#include <unistd.h>

static unsigned cyc_hi = 0, cyc_lo = 0;

static void access_counter(unsigned *hi, unsigned *lo) {
    asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
    : "=r" (*hi), "=r" (*lo)
    : /* No input */
    : "%edx", "%eax");
}

void start_counter() {
    access_counter(&cyc_hi, &cyc_lo);
}

double get_counter() {
    unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
    double result;

    access_counter(&ncyc_hi, &ncyc_lo);

    lo = ncyc_lo - cyc_lo;
    borrow = lo > ncyc_lo;
    hi = ncyc_hi - cyc_hi - borrow;

    result = (double) hi * (1 << 30) * 4 + lo;

    return result;
}

但是,我需要将此代码移植到具有不同 CPU 频率的机器上。为此,我尝试计算运行代码的机器的 CPU 频率,如下所示:

int main(void)
{
    double c1, c2;

    start_counter();

    c1 = get_counter();
    sleep(1);
    c2 = get_counter();

    printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
    printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);

    return 0;
}

问题是结果始终为 0,我无法理解为什么。我在 VMware 上以访客身份运行 Linux (Arch)。

在朋友的机器(MacBook)上,它在某种程度上可以工作;我的意思是,结果大于 0,但它是可变的,因为 CPU 频率不固定(我们试图修复它,但由于某种原因我们无法做到这一点)。他有另一台运行 Linux (Ubuntu) 作为主机的机器,它也报告 0。这排除了问题出在虚拟机上的可能性,我一开始认为这是问题所在。

任何想法为什么会发生这种情况以及如何解决它?

The following piece of code was given to us from our instructor so we could measure some algorithms performance:

#include <stdio.h>
#include <unistd.h>

static unsigned cyc_hi = 0, cyc_lo = 0;

static void access_counter(unsigned *hi, unsigned *lo) {
    asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
    : "=r" (*hi), "=r" (*lo)
    : /* No input */
    : "%edx", "%eax");
}

void start_counter() {
    access_counter(&cyc_hi, &cyc_lo);
}

double get_counter() {
    unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
    double result;

    access_counter(&ncyc_hi, &ncyc_lo);

    lo = ncyc_lo - cyc_lo;
    borrow = lo > ncyc_lo;
    hi = ncyc_hi - cyc_hi - borrow;

    result = (double) hi * (1 << 30) * 4 + lo;

    return result;
}

However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:

int main(void)
{
    double c1, c2;

    start_counter();

    c1 = get_counter();
    sleep(1);
    c2 = get_counter();

    printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
    printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);

    return 0;
}

The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.

On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.

Any ideas why this is happening and how can I fix it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

人生戏 2024-09-08 12:42:02

好吧,由于其他答案没有帮助,我将尝试更详细地解释。问题是现代 CPU 可能会乱序执行指令。您的代码开始时类似于:

rdtsc
push 1
call sleep
rdtsc

现代 CPU 不一定按其原始顺序执行指令。不管你原来的命令是什么,CPU(大部分)都可以自由地执行,就像:

rdtsc
rdtsc
push 1
call sleep

在这种情况下,很明显为什么两个 rdtsc 之间的差异将是(至少非常接近)0。为了防止这种情况发生,您需要执行一条 CPU 永远不会重新排列以乱序执行的指令。最常用的指令是CPUID。我链接的另一个答案应该(如果没记错的话)大致从那里开始,关于正确/有效地使用 CPUID 来完成此任务所需的步骤。

当然,Tim Post 可能是对的,而且您也可能会因为虚拟机而遇到问题。尽管如此,就目前而言,即使在真实的硬件上,也不能保证您的代码能够正常工作。

编辑:至于为什么代码工作:嗯,首先,指令可以乱序执行这一事实并不能保证它们是。其次,sleep(至少某些实现)可能包含阻止 rdtsc 在其周围重新排列的序列化指令,而其他实现则不包含(或可能包含它们,但仅在特定(但未指定)情况下执行它们)。

剩下的行为可能会随着几乎任何重新编译而改变,甚至只是在一次运行和下一次运行之间发生改变。它可以连续数十次产生极其准确的结果,然后由于某些(几乎)完全无法解释的原因而失败(例如,完全在其他过程中发生的事情)。

Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:

rdtsc
push 1
call sleep
rdtsc

Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:

rdtsc
rdtsc
push 1
call sleep

In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.

Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.

Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).

What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).

半枫 2024-09-08 12:42:02

我不能肯定地说你的代码到底出了什么问题,但你为这样一个简单的指令做了相当多不必要的工作。我建议您大幅简化您的 rdtsc 代码。您不需要自己进行 64 位数学运算,也不需要将该运算的结果存储为双精度。您不需要在内联汇编中使用单独的输出,您可以告诉 GCC 使用 eax 和 edx。

这是此代码的一个大大简化的版本:

#include <stdint.h>

uint64_t rdtsc() {
    uint64_t ret;

# if __WORDSIZE == 64
    asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
        : "=A"(ret)
        : /* no input */
        : "%edx"
    );
#else
    asm ("rdtsc" 
        : "=A"(ret)
    );
#endif
    return ret;
}

此外,您还应该考虑打印出从中获得的值,以便您可以查看是否获得了 0 或其他值。

I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.

Here is a greatly simplified version of this code:

#include <stdint.h>

uint64_t rdtsc() {
    uint64_t ret;

# if __WORDSIZE == 64
    asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
        : "=A"(ret)
        : /* no input */
        : "%edx"
    );
#else
    asm ("rdtsc" 
        : "=A"(ret)
    );
#endif
    return ret;
}

Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.

伴梦长久 2024-09-08 12:42:02

至于 VMWare,请查看计时规范(PDF 链接),如下所示以及此线程。 TSC 指令是(取决于来宾操作系统):

  • 直接传递到真实硬件(PV 来宾)
  • 当虚拟机在主机处理器(Windows / 等)上执行时对周期进行计数

注意,在 #2 中虚拟机在主机处理器上执行时。如果我没记错的话,Xen 也会出现同样的现象。本质上,您可以预期代码应该在半虚拟化来宾上按预期工作。如果进行仿真,那么期望像硬件一样的一致性是完全不合理的。

As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):

  • Passed directly to the real hardware (PV guest)
  • Count cycles while the VM is executing on the host processor (Windows / etc)

Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.

很快妥协 2024-09-08 12:42:02

您忘记在 asm 语句中使用 volatile,因此您告诉编译器 asm 语句每次都会产生相同的输出,例如一个纯函数。 (易失性仅对于没有输出的asm语句是隐式的。)

这解释了为什么你的结果完全为零:编译器优化了在编译时,通过 CSE(公共子表达式消除)将 end-start 转换为 0

请参阅我对 获取 CPU 周期计数? 的回答,了解 __rdtsc( ) 内在的,@Mysticial 的答案有工作 GNU C 内联汇编,我将在这里引用:

// 更喜欢使用 __rdtsc() 内在函数而不是内联汇编。
uint64_t rdtsc(){
    无符号整型 lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    返回 ((uint64_t)hi << 32) |瞧;
}

对于 32 位和 64 位代码,这可以正确有效地工作。

You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)

This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).

See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and @Mysticial's answer there has working GNU C inline asm, which I'll quote here:

// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

This works correctly and efficiently for 32 and 64-bit code.

无悔心 2024-09-08 12:42:02

嗯,我不太肯定,但我怀疑问题可能出在这一行内:

result = (double) hi * (1 << 30) * 4 + lo;

我怀疑你是否可以安全地在“无符号”中执行如此巨大的乘法......这通常不是一个 32 位数字吗? ...事实上,您无法安全地乘以 2^32 并且必须将其作为额外的“* 4”添加到最后的 2^30 中,这一事实已经暗示了这种可能性...您可能需要将每个子组件 hi 和 lo 转换为双精度型(而不是最后的单个双精度型),并使用两个双精度型进行乘法

hmmm I'm not positive but I suspect the problem may be inside this line:

result = (double) hi * (1 << 30) * 4 + lo;

I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文