测量执行单个指令的时间

发布于 2024-08-29 09:50:48 字数 48 浏览 5 评论 0原文

有没有办法使用 C 或汇编程序甚至 C# 来准确测量执行 ADD 指令所需的时间?

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

西瓜 2024-09-05 09:50:48

是的,有点,但它并不简单,并且产生的结果几乎毫无意义,至少在最合理的现代处理器上是这样。

在相对较慢的处理器上(例如,直到英特尔系列中的原始奔腾处理器,在大多数小型嵌入式处理器上仍然如此),您只需查看处理器的数据表,它(通常)会告诉您预期有多少个时钟周期。快速、简单、容易。

在现代台式机(例如 Pentium Pro 或更新版本)上,情况远没有那么简单。这些 CPU 可以一次执行多条指令,并且只要它们之间不存在任何依赖关系,就可以乱序执行它们。这意味着单个指令所花费的时间的整个概念变得几乎毫无意义。执行一条指令所花费的时间可以并且将取决于它周围的指令。

也就是说,是的,如果你真的愿意,你可以(通常 - 取决于处理器)测量一些东西,尽管它到底意味着什么仍然存在相当大的问题。即使得到这样的结果,只是接近毫无意义而不是完全无意义,但这也不是微不足道的。例如,在 Intel 或 AMD 芯片上,您可以使用 RDTSC 本身进行时序测量。不幸的是,如上所述,这可能会乱序执行。为了获得有意义的结果,您需要用一条不能乱序执行的指令(“序列化指令”)包围它。最常见的选择是 CPUID,因为它是“用户模式”(即环 3)程序可用的少数序列化指令之一。不过,这本身就增加了一些扭曲:根据英特尔的记录,处理器执行 CPUID 的前几次可能比后续时间花费更长的时间。因此,他们建议您在使用它来序列化计时之前先执行它三次。因此,一般序列运行如下:

.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC

然后将其与执行相同操作但删除了被测试序列的结果进行比较。当然,这遗漏了相当多的细节 - 至少您需要:

  1. 在每个 CPUID 之前正确设置寄存器
  2. 在第一个 RDTSC 之后将值保存在 EAX:EDX 中
  3. 从第一个 RDTSC 中减去第二个 RDTSC 的结果

另请注意“我插入的“align”指令——指令对齐也会影响时序,特别是在涉及循环的情况下。

Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.

On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.

On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.

That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:

.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC

Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:

  1. set the registers up correctly before each CPUID
  2. save the value in EAX:EDX after the first RDTSC
  3. subtract result from the second RDTSC from the first

Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.

旧时模样 2024-09-05 09:50:48

构造一个执行 1000 万次的循环,循环体中没有任何内容,并计时。将该时间保留为循环所需的开销。

然后再次执行相同的循环,这次是在主体中使用被测试的代码。此循环的时间减去开销(来自空循环情况)就是测试代码重复 1000 万次所花费的时间。因此,除以迭代次数。

显然,该方法需要针对迭代次数进行调整。如果您测量的内容很小,例如一条指令,您甚至可能希望运行十亿次以上的迭代。如果它是一个重要的代码块,那么几十或几千就足够了。

对于单个汇编指令,汇编器可能是适合该工作的工具,或者如果您熟悉内联汇编,则可能是 C。其他人已经发布了更优雅的解决方案,用于如何在不重复的情况下获得测量,但重复技术始终可用,例如,没有其他人提到的良好计时指令的嵌入式处理器。

但请注意,在现代流水线处理器上,指令级并行性可能会混淆您的结果。由于执行管道中同时运行多条指令,因此给定指令的 N 次重复所花费的时间不再是单个指令的 N 倍。

Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.

Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.

Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.

In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.

Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.

若水微香 2024-09-05 09:50:48

好吧,如果您使用 Windows、Linux、Unix、MacOS、AmigaOS 等操作系统以及所有其他操作系统,您将遇到的问题是您的计算机上已经在后台运行了大量进程,这会影响性能。计算指令实际时间的唯一真正方法是拆卸主板并使用外部硬件测试每个组件。这取决于您是否绝对想自己执行此操作,或者只是了解处理器的典型版本实际运行的速度。英特尔和摩托罗拉等公司在发布之前对其芯片进行了广泛的测试,并且这些结果向公众公开。您所需要做的就是询问他们,他们会寄给您一张免费的 CD-ROM(可能是 DVD - 废话迂腐),其中包含结果。您可以自己执行此操作,但请注意,特别是英特尔处理器包含许多不再需要的冗余指令,更不用说必要的了。这会占用你很多时间,但我绝对可以看到这样做的乐趣。附言。如果纯粹是为了帮助您在个人项目中将自己的机器硬件推向理论最大值,那么上面的杰夫的答案非常适合在现实条件下生成整洁的指令速度平均值。

Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.

亚希 2024-09-05 09:50:48

不需要,但您可以根据加法指令所需的时钟周期数乘以 CPU 的时钟速率来计算。 ADD 的不同类型的参数可能会导致更多或更少的周期,但对于给定的参数列表,指令始终需要相同数量的周期才能完成。

也就是说,你为什么关心?

No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.

That said, why do you care?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文