C/C++ 中基于 CPU 周期计数的分析Linux x86_64

发布于 2024-09-25 17:10:15 字数 296 浏览 1 评论 0原文

我使用以下代码来分析我的操作,以优化我的函数中占用的 CPU 周期。

static __inline__ unsigned long GetCC(void)
{
  unsigned a, d; 
  asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
  return ((unsigned long)a) | (((unsigned long)d) << 32); 
}

我不认为这是最好的,因为即使连续两次调用也会给我带来“33”的差异。 有什么建议吗?

I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.

static __inline__ unsigned long GetCC(void)
{
  unsigned a, d; 
  asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
  return ((unsigned long)a) | (((unsigned long)d) << 32); 
}

I don't think it is the best since even two consecutive calls gives me a difference of "33".
Any suggestions ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

如梦初醒的夏天 2024-10-02 17:10:15

我个人认为 rdtsc 指令很棒并且可用于各种任务。我不认为使用 cpuid 是准备 rdtsc 所必需的。以下是我对 rdtsc 的推理:

  1. 由于我使用 Watcom 编译器,所以我使用“#pragma aux”实现了 rdtsc,这意味着 C 编译器将内联生成指令,期望 edx:eax 中的结果,并通知其优化器eax 和 edx 的内容已被修改。与传统的 _asm 实现相比,这是一个巨大的改进,在传统的 _asm 实现中,优化器将远离 _asm 附近的优化。我还使用“#pragma aux”实现了divide_U8_by_U4,这样当我将clock_cycles转换为us或ms时,我不需要调用lib函数。
  2. 每次执行 rdtsc 都会产生一些开销(如果按照作者的示例进行封装,则会产生更多开销),要测量的序列越短,必须更多地考虑这一点。一般来说,我不会将序列时间短于内部时钟频率的 1/30,通常计算为 1/10^8 秒(3 GHZ 内部时钟)。我使用这些测量作为指示,而不是事实。知道了这一点我就可以省略 cpuid 了。我测量的次数越多,就越接近事实。
  3. 为了可靠地测量,我将使用 1/100 - 1/300 范围,即 0.03 - 0.1 us。在此范围内,使用 cpuid 带来的额外精度实际上是微不足道的。我使用这个范围来进行短序列计时。这是我的“非标准”单元,因为它取决于 CPU 的内部时钟频率。例如,在 1 GHz 机器上,我不会使用 0.03 us,因为这会使我超出 1/100 限制,并且我的读数将成为指示。这里我会使用0.1 us作为最短时间测量单位。不会使用 1/300,因为它太接近 1 us(见下文)而不会产生任何显着差异。
  4. 对于更长的处理序列,我将两个 rdtsc 读数之间的差异除以 3000(对于 3 GHz),并将经过的时钟周期转换为我们。实际上我使用 (diff+1500)/3000,其中 1500 是 3000 的一半。对于 I/O 等待,我使用毫秒 => (差异+1500000)/3000000。这些是我的“标准”单位。我很少用秒。
  5. 有时我会得到意想不到的缓慢结果,然后我必须问自己:这是由于中断还是代码造成的?我又测量了几次,看看这是否确实是一个中断。在这种情况下……现实世界中经常会发生中断。如果我的序列很短,那么下一次测量很可能不会被中断。如果序列较长,中断会更频繁地发生,而我对此无能为力。
  6. 非常准确地测量长的经过时间(小时和更长的 ET 为 us 或更低)会增加在divide_U8_by_U4 中出现除法异常的风险,因此我思考何时使用 us 以及何时使用 ms。
  7. 我还有基本统计代码。使用它我可以记录最小值和最大值,并且可以计算平均值和标准差。该代码并不简单,因此必须从测量的 ET 中减去其自身的 ET。
  8. 如果编译器正在进行广泛的优化,并且您的读数存储在局部变量中,则编译器可能会(“正确地”)确定可以省略代码。避免这种情况的一种方法是将结果存储在公共(非静态、非基于堆栈)变量中。
  9. 在现实世界条件下运行的程序应该在现实世界条件下进行测量,这是没有办法解决的。

至于时间戳计数器是否准确的问题,我想说的是,假设不同内核上的 tsc 是同步的(这是常态),则在低活动期间存在 CPU 节流问题,以减少能耗。测试时始终可以抑制功能。如果您在同一处理器上以 1 GHz 或 10 Mhz 执行指令,则经过的周期计数将相同,即使前者完成的时间仅为后者的 1%。

I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:

  1. Since I use the Watcom compiler I've implemented rdtsc using "#pragma aux" which means that the C compiler will generate the instruction inline, expect the result in edx:eax and also inform its optimizer that the contents of eax and edx have been modified. This is a huge improvement from traditional _asm implementations where the optimizer would stay away from optimizing in _asm's vicinity. I've also implemented a divide_U8_by_U4 using "#pragma aux" so that I won't need to call a lib function when I convert clock_cycles to us or ms.
  2. Every execution of rdtsc will result in some overhead (A LOT more if it is encapsulated as in the author's example) which must be taken more into account the shorter the sequence to measure is. Generally I don't time shorter sequences than 1/30 of the internal clock frequency which usually works out to 1/10^8 seconds (3 GHZ internal clock). I use such measurements as indications, not fact. Knowing this I can leave out cpuid. The more times I measure, the closer to fact I will get.
  3. To measure reliably I would use the 1/100 - 1/300 range i/e 0.03 - 0.1 us. In this range the additional accuracy of using cpuid is practically insignificant. I use this range for short sequence timing. This is my "non-standard" unit since it is dependent on the CPU's internal clock frequency. For example on a 1 GHz machine I would not use 0.03 us because that would put me outside the 1/100 limit and my readings would become indications. Here I would use 0.1 us as the shortest time measurement unit. 1/300 would not be used since it would be too close to 1 us (see below) to make any significant difference.
  4. For even longer processing sequences I divide the difference between two rdtsc reading with say 3000 (for 3 GHz) and will convert the elapsed clock cycles to us. Actually I use (diff+1500)/3000 where 1500 is half of 3000. For I/O waits I use milliseconds => (diff+1500000)/3000000. These are my "standard" units. I very seldom use seconds.
  5. Sometimes I get unexpectedly slow results and then I must ask myself: is this due to an interrupt or to the code? I measure a few more times to see if it was, indeed, an interrupt. In that case ... well interrupts happen all the time in the real world. If my sequence is short then there is a good possibility that the next measurement won't be interrupted. If the sequence is longer interrupts will occur more often and there isn't much I can do about it.
  6. Measuring long elapsed times very accurately (hour and longer ETs in us or lower) will increase the risk of getting a division exception in divide_U8_by_U4, so I think through when to use us and when to use ms.
  7. I also have code for basic statistics. Using this I log min and max values and I can calculate mean and standard deviation. This code is non-trivial so its own ET must be subtracted from the measured ETs.
  8. If the compiler is doing extensive optimizations and your readings are stored in local variables the compiler may determine ("correctly") that the code can be omitted. One way to avoid this is to store the results in public (non-static, non-stack-based) variables.
  9. Programs running in real-world conditions should be measured in real-world conditions, there's no way around that.

As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.

简美 2024-10-02 17:10:15

您走在正确的轨道上1,但您需要做两件事:

  1. rdtsc 之前运行 cpuid 指令来刷新 CPU 管道(进行测量更可靠)。据我记得它破坏了从eaxedx的寄存器。
  2. 实时测量。执行时间不仅仅是 CPU 周期(锁定争用、上下文切换和其他您无法控制的开销)。实时校准 TSC 刻度。您可以在一个简单的循环中完成此操作,该循环获取 gettimeofday(Linux,因为您没有提到平台)调用和 rdtsc 输出等测量值的差异。然后您可以知道每个 TSC 滴答需要多长时间。另一个考虑因素是跨 CPU 的 TSC 同步,因为每个内核可能有自己的计数器。在 Linux 中,您可以在 /proc/cpuinfo 中看到它,您的 CPU 应该有一个 constant_tsc 标志。我见过的大多数较新的 Intel CPU 都有这个标志。

1个人发现rdtscgettimeofday()等系统调用更准确,可以进行细粒度测量。

You are on the right track1, but you need to do two things:

  1. Run cpuid instruction before rdtsc to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers from eax to edx.
  2. Measure real time. There is a lot more to execution time, than just CPU cycles (locking contention, context switches and other overhead you don't control). Calibrate TSC ticks with real time. You can do it in a simple loop that takes differences in measurements of, say, gettimeofday (Linux, since you didn't mentioned the platform) calls and rdtsc output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in /proc/cpuinfo, your CPU should have a constant_tsc flag. Most newer Intel CPUs I've seen have this flag.

1I have personally found rdtsc to be more accurate than system calls like gettimeofday() for fine-grained measurements.

指尖上得阳光 2024-10-02 17:10:15

尝试计算单个函数执行的周期并不是真正正确的方法。事实上,您的进程可能随时被中断,以及缓存未命中和分支预测错误导致的延迟,这意味着调用之间的周期数可能存在相当大的偏差。

正确的方法是:

  • 计算大量调用该函数所花费的周期数或 CPU 时间(使用clock()),然后求平均值;或
  • 使用循环级模拟分析器,例如 Callgrind / kcachegrind

顺便说一句,您需要在RDTSC之前执行序列化指令。通常使用CPUID

Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.

The right way is either:

  • Count the number of cycles or CPU time (with clock()) taken for a large number of calls to the function, then average them; or
  • Use a cycle-level emulating profiler like Callgrind / kcachegrind.

By the way, you need to execute a serialising instruction before RDTSC. Typically CPUID is used.

拥抱我好吗 2024-10-02 17:10:15

您可能需要担心的另一件事是,如果您在多核计算机上运行,​​则程序可能会移动到不同的核心,该核心将具有不同的 rdtsc 计数器。不过,您也许可以通过系统调用将进程固定到一个核心。

如果我试图测量这样的东西,我可能会将时间戳记录到一个数组中,然后在基准测试完成后返回并检查该数组。当您检查记录到时间戳数组中的数据时,您应该记住该数组将依赖于 CPU 缓存(如果您的数组很大,则可能需要分页),但您可以预取或在分析时记住这一点数据。您应该看到时间戳之间有一个非常规则的时间增量,但有几个峰值,也可能有一些下降(可能是由于移动到不同的核心)。常规时间增量可能是最好的测量值,因为它表明没有外部事件影响这些测量值。

话虽这么说,如果您进行基准测试的代码具有不规则的内存访问模式或运行时间,或者依赖于系统调用(尤其是 IO 相关的调用),那么您将很难将噪声与您感兴趣的数据分开。

Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.

If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.

That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.

雪落纷纷 2024-10-02 17:10:15

TSC 并不是一个很好的时间衡量标准。 CPU 对 TSC 的唯一保证是它单调上升(也就是说,如果您RDTSC一次然后再执行一次,第二次将返回比第一个更高的结果)并且需要很长时间才能结束。

The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you RDTSC once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.

很酷又爱笑 2024-10-02 17:10:15

Linux perf_event_open 系统调用,其中 config = PERF_COUNT_HW_CPU_CYCLES

此 Linux 系统调用似乎是性能事件的跨架构包装器。

这个答案与这个 C++ 问题的答案基本相同: 如何从 C++ 获取 x86_64 中的 CPU 周期计数? 请参阅该答案以获取更多详细信息。

perf_event_open.c

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("%lld\n", count);

    close(fd);
}

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES

This Linux system call appears to be a cross architecture wrapper for performance events.

This answer is basically the same as the one for this C++ question: How to get the CPU cycle count in x86_64 from C++? see that answer for more details.

perf_event_open.c

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("%lld\n", count);

    close(fd);
}
‘画卷フ 2024-10-02 17:10:15

我是否正确理解,您这样做的原因是将其他代码与其括起来,以便您可以测量其他代码需要多长时间?

我确信您知道另一种好方法,就是循环其他代码 10^6 次,用秒表计时,并将其称为微秒。

一旦您测量了其他代码,我是否正确地假设您想知道其中哪些行值得优化,以减少所需的时间?

如果是这样,那么您就已经踏入了成熟的道路。您可以使用 ZoomLTProf.这是我最喜欢的方法。

Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?

I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.

Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?

If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文