使用 RDTSC 在 C 中计算 CPU 频率始终返回 0
我们的讲师向我们提供了以下代码,以便我们可以测量一些算法的性能:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
但是,我需要将此代码移植到具有不同 CPU 频率的机器上。为此,我尝试计算运行代码的机器的 CPU 频率,如下所示:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
问题是结果始终为 0,我无法理解为什么。我在 VMware 上以访客身份运行 Linux (Arch)。
在朋友的机器(MacBook)上,它在某种程度上可以工作;我的意思是,结果大于 0,但它是可变的,因为 CPU 频率不固定(我们试图修复它,但由于某种原因我们无法做到这一点)。他有另一台运行 Linux (Ubuntu) 作为主机的机器,它也报告 0。这排除了问题出在虚拟机上的可能性,我一开始认为这是问题所在。
任何想法为什么会发生这种情况以及如何解决它?
The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
好吧,由于其他答案没有帮助,我将尝试更详细地解释。问题是现代 CPU 可能会乱序执行指令。您的代码开始时类似于:
现代 CPU 不一定按其原始顺序执行指令。不管你原来的命令是什么,CPU(大部分)都可以自由地执行,就像:
在这种情况下,很明显为什么两个 rdtsc 之间的差异将是(至少非常接近)0。为了防止这种情况发生,您需要执行一条 CPU 永远不会重新排列以乱序执行的指令。最常用的指令是
CPUID
。我链接的另一个答案应该(如果没记错的话)大致从那里开始,关于正确/有效地使用 CPUID 来完成此任务所需的步骤。当然,Tim Post 可能是对的,而且您也可能会因为虚拟机而遇到问题。尽管如此,就目前而言,即使在真实的硬件上,也不能保证您的代码能够正常工作。
编辑:至于为什么代码会工作:嗯,首先,指令可以乱序执行这一事实并不能保证它们会是。其次,
sleep
(至少某些实现)可能包含阻止 rdtsc 在其周围重新排列的序列化指令,而其他实现则不包含(或可能包含它们,但仅在特定(但未指定)情况下执行它们)。剩下的行为可能会随着几乎任何重新编译而改变,甚至只是在一次运行和下一次运行之间发生改变。它可以连续数十次产生极其准确的结果,然后由于某些(几乎)完全无法解释的原因而失败(例如,完全在其他过程中发生的事情)。
Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
In this case, it's clear why the difference between the two
rdtsc
s would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that isCPUID
. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to useCPUID
correctly/effectively for this task.Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of)
sleep
contain serializing instructions that preventrdtsc
from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).
我不能肯定地说你的代码到底出了什么问题,但你为这样一个简单的指令做了相当多不必要的工作。我建议您大幅简化您的 rdtsc 代码。您不需要自己进行 64 位数学运算,也不需要将该运算的结果存储为双精度。您不需要在内联汇编中使用单独的输出,您可以告诉 GCC 使用 eax 和 edx。
这是此代码的一个大大简化的版本:
此外,您还应该考虑打印出从中获得的值,以便您可以查看是否获得了 0 或其他值。
I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your
rdtsc
code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.Here is a greatly simplified version of this code:
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.
至于 VMWare,请查看计时规范(PDF 链接),如下所示以及此线程。 TSC 指令是(取决于来宾操作系统):
注意,在 #2 中当虚拟机在主机处理器上执行时。如果我没记错的话,Xen 也会出现同样的现象。本质上,您可以预期代码应该在半虚拟化来宾上按预期工作。如果进行仿真,那么期望像硬件一样的一致性是完全不合理的。
As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.
您忘记在 asm 语句中使用
volatile
,因此您告诉编译器asm
语句每次都会产生相同的输出,例如一个纯函数。 (易失性
仅对于没有输出的asm
语句是隐式的。)这解释了为什么你的结果完全为零:编译器优化了
在编译时,通过 CSE(公共子表达式消除)将 end-start
转换为0
。请参阅我对 获取 CPU 周期计数? 的回答,了解
__rdtsc( )
内在的,@Mysticial 的答案有工作 GNU C 内联汇编,我将在这里引用:对于 32 位和 64 位代码,这可以正确有效地工作。
You forgot to use
volatile
in your asm statement, so you're telling the compiler that theasm
statement produces the same output every time, like a pure function. (volatile
is only implicit forasm
statements with no outputs.)This explains why you're getting exactly zero: the compiler optimized
end-start
to0
at compile time, through CSE (common-subexpression elimination).See my answer on Get CPU cycle count? for the
__rdtsc()
intrinsic, and @Mysticial's answer there has working GNU C inline asm, which I'll quote here:This works correctly and efficiently for 32 and 64-bit code.
嗯,我不太肯定,但我怀疑问题可能出在这一行内:
result = (double) hi * (1 << 30) * 4 + lo;
我怀疑你是否可以安全地在“无符号”中执行如此巨大的乘法......这通常不是一个 32 位数字吗? ...事实上,您无法安全地乘以 2^32 并且必须将其作为额外的“* 4”添加到最后的 2^30 中,这一事实已经暗示了这种可能性...您可能需要将每个子组件 hi 和 lo 转换为双精度型(而不是最后的单个双精度型),并使用两个双精度型进行乘法
hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles