当前位置：文江博客话题详情

测量执行单个指令的时间

发布于 2024-08-29 09:50:48 字数 48 浏览 16 评论 0原文

有没有办法使用 C 或汇编程序甚至 C# 来准确测量执行 ADD 指令所需的时间？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

西瓜 2024-09-05 09:50:48

是的，有点，但它并不简单，并且产生的结果几乎毫无意义，至少在最合理的现代处理器上是这样。

在相对较慢的处理器上（例如，直到英特尔系列中的原始奔腾处理器，在大多数小型嵌入式处理器上仍然如此），您只需查看处理器的数据表，它（通常）会告诉您预期有多少个时钟周期。快速、简单、容易。

在现代台式机（例如 Pentium Pro 或更新版本）上，情况远没有那么简单。这些 CPU 可以一次执行多条指令，并且只要它们之间不存在任何依赖关系，就可以乱序执行它们。这意味着单个指令所花费的时间的整个概念变得几乎毫无意义。执行一条指令所花费的时间可以并且将取决于它周围的指令。

也就是说，是的，如果你真的愿意，你可以（通常 - 取决于处理器）测量一些东西，尽管它到底意味着什么仍然存在相当大的问题。即使得到这样的结果，只是接近毫无意义而不是完全无意义，但这也不是微不足道的。例如，在 Intel 或 AMD 芯片上，您可以使用 RDTSC 本身进行时序测量。不幸的是，如上所述，这可能会乱序执行。为了获得有意义的结果，您需要用一条不能乱序执行的指令（“序列化指令”）包围它。最常见的选择是 CPUID，因为它是“用户模式”（即环 3）程序可用的少数序列化指令之一。不过，这本身就增加了一些扭曲：根据英特尔的记录，处理器执行 CPUID 的前几次可能比后续时间花费更长的时间。因此，他们建议您在使用它来序列化计时之前先执行它三次。因此，一般序列运行如下：

.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC

然后将其与执行相同操作但删除了被测试序列的结果进行比较。当然，这遗漏了相当多的细节 - 至少您需要：

在每个 CPUID 之前正确设置寄存器
在第一个 RDTSC 之后将值保存在 EAX:EDX 中
从第一个 RDTSC 中减去第二个 RDTSC 的结果

另请注意“我插入的“align”指令——指令对齐也会影响时序，特别是在涉及循环的情况下。

Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.

On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.

On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.

That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:

.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC

Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:

set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first

Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.

回复收藏 0 原文

旧时模样 2024-09-05 09:50:48

构造一个执行 1000 万次的循环，循环体中没有任何内容，并计时。将该时间保留为循环所需的开销。

然后再次执行相同的循环，这次是在主体中使用被测试的代码。此循环的时间减去开销（来自空循环情况）就是测试代码重复 1000 万次所花费的时间。因此，除以迭代次数。

显然，该方法需要针对迭代次数进行调整。如果您测量的内容很小，例如一条指令，您甚至可能希望运行十亿次以上的迭代。如果它是一个重要的代码块，那么几十或几千就足够了。

对于单个汇编指令，汇编器可能是适合该工作的工具，或者如果您熟悉内联汇编，则可能是 C。其他人已经发布了更优雅的解决方案，用于如何在不重复的情况下获得测量，但重复技术始终可用，例如，没有其他人提到的良好计时指令的嵌入式处理器。

但请注意，在现代流水线处理器上，指令级并行性可能会混淆您的结果。由于执行管道中同时运行多条指令，因此给定指令的 N 次重复所花费的时间不再是单个指令的 N 倍。

回复收藏 0 原文

若水微香 2024-09-05 09:50:48

好吧，如果您使用 Windows、Linux、Unix、MacOS、AmigaOS 等操作系统以及所有其他操作系统，您将遇到的问题是您的计算机上已经在后台运行了大量进程，这会影响性能。计算指令实际时间的唯一真正方法是拆卸主板并使用外部硬件测试每个组件。这取决于您是否绝对想自己执行此操作，或者只是了解处理器的典型版本实际运行的速度。英特尔和摩托罗拉等公司在发布之前对其芯片进行了广泛的测试，并且这些结果向公众公开。您所需要做的就是询问他们，他们会寄给您一张免费的 CD-ROM（可能是 DVD - 废话迂腐），其中包含结果。您可以自己执行此操作，但请注意，特别是英特尔处理器包含许多不再需要的冗余指令，更不用说必要的了。这会占用你很多时间，但我绝对可以看到这样做的乐趣。附言。如果纯粹是为了帮助您在个人项目中将自己的机器硬件推向理论最大值，那么上面的杰夫的答案非常适合在现实条件下生成整洁的指令速度平均值。

回复收藏 0 原文