C/C++ 中基于 CPU 周期计数的分析Linux x86_64
我使用以下代码来分析我的操作,以优化我的函数中占用的 CPU 周期。
static __inline__ unsigned long GetCC(void)
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
我不认为这是最好的,因为即使连续两次调用也会给我带来“33”的差异。 有什么建议吗?
I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.
static __inline__ unsigned long GetCC(void)
{
unsigned a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long)a) | (((unsigned long)d) << 32);
}
I don't think it is the best since even two consecutive calls gives me a difference of "33".
Any suggestions ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我个人认为 rdtsc 指令很棒并且可用于各种任务。我不认为使用 cpuid 是准备 rdtsc 所必需的。以下是我对 rdtsc 的推理:
至于时间戳计数器是否准确的问题,我想说的是,假设不同内核上的 tsc 是同步的(这是常态),则在低活动期间存在 CPU 节流问题,以减少能耗。测试时始终可以抑制功能。如果您在同一处理器上以 1 GHz 或 10 Mhz 执行指令,则经过的周期计数将相同,即使前者完成的时间仅为后者的 1%。
I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:
As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.
您走在正确的轨道上1,但您需要做两件事:
rdtsc
之前运行cpuid
指令来刷新 CPU 管道(进行测量更可靠)。据我记得它破坏了从eax
到edx
的寄存器。/proc/cpuinfo
中看到它,您的 CPU 应该有一个constant_tsc
标志。我见过的大多数较新的 Intel CPU 都有这个标志。1我个人发现
rdtsc
比gettimeofday()
等系统调用更准确,可以进行细粒度测量。You are on the right track1, but you need to do two things:
cpuid
instruction beforerdtsc
to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers fromeax
toedx
.gettimeofday
(Linux, since you didn't mentioned the platform) calls andrdtsc
output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in/proc/cpuinfo
, your CPU should have aconstant_tsc
flag. Most newer Intel CPUs I've seen have this flag.1I have personally found
rdtsc
to be more accurate than system calls likegettimeofday()
for fine-grained measurements.尝试计算单个函数执行的周期并不是真正正确的方法。事实上,您的进程可能随时被中断,以及缓存未命中和分支预测错误导致的延迟,这意味着调用之间的周期数可能存在相当大的偏差。
正确的方法是:
clock()
),然后求平均值;或顺便说一句,您需要在
RDTSC
之前执行序列化指令。通常使用CPUID
。Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.
The right way is either:
clock()
) taken for a large number of calls to the function, then average them; orBy the way, you need to execute a serialising instruction before
RDTSC
. TypicallyCPUID
is used.您可能需要担心的另一件事是,如果您在多核计算机上运行,则程序可能会移动到不同的核心,该核心将具有不同的 rdtsc 计数器。不过,您也许可以通过系统调用将进程固定到一个核心。
如果我试图测量这样的东西,我可能会将时间戳记录到一个数组中,然后在基准测试完成后返回并检查该数组。当您检查记录到时间戳数组中的数据时,您应该记住该数组将依赖于 CPU 缓存(如果您的数组很大,则可能需要分页),但您可以预取或在分析时记住这一点数据。您应该看到时间戳之间有一个非常规则的时间增量,但有几个峰值,也可能有一些下降(可能是由于移动到不同的核心)。常规时间增量可能是最好的测量值,因为它表明没有外部事件影响这些测量值。
话虽这么说,如果您进行基准测试的代码具有不规则的内存访问模式或运行时间,或者依赖于系统调用(尤其是 IO 相关的调用),那么您将很难将噪声与您感兴趣的数据分开。
Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.
If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.
That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.
TSC 并不是一个很好的时间衡量标准。 CPU 对 TSC 的唯一保证是它单调上升(也就是说,如果您
RDTSC
一次然后再执行一次,第二次将返回比第一个更高的结果)并且需要很长时间才能结束。The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you
RDTSC
once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.Linux
perf_event_open
系统调用,其中config = PERF_COUNT_HW_CPU_CYCLES
此 Linux 系统调用似乎是性能事件的跨架构包装器。
这个答案与这个 C++ 问题的答案基本相同: 如何从 C++ 获取 x86_64 中的 CPU 周期计数? 请参阅该答案以获取更多详细信息。
perf_event_open.c
Linux
perf_event_open
system call withconfig = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer is basically the same as the one for this C++ question: How to get the CPU cycle count in x86_64 from C++? see that answer for more details.
perf_event_open.c
我是否正确理解,您这样做的原因是将其他代码与其括起来,以便您可以测量其他代码需要多长时间?
我确信您知道另一种好方法,就是循环其他代码 10^6 次,用秒表计时,并将其称为微秒。
一旦您测量了其他代码,我是否正确地假设您想知道其中哪些行值得优化,以减少所需的时间?
如果是这样,那么您就已经踏入了成熟的道路。您可以使用 Zoom 或 LTProf.这是我最喜欢的方法。
Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?
I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.
Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?
If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.