当前位置：文江博客话题详情

如何获取 Win32 中的 CPU 周期计数？

发布于 2024-07-05 20:20:01 字数 429 浏览 11 评论 0原文

在 Win32 中，有没有办法获得唯一的 cpu 周期计数或类似的东西，对于多个进程/语言/系统/等来说是统一的。

我正在创建一些日志文件，但必须生成多个日志文件，因为我们托管 .NET 运行时，并且我希望避免从一个日志文件调用另一个日志文件来记录日志。因此，我想我只需生成两个文件，将它们组合起来，然后对它们进行排序，以获得涉及跨世界调用的连贯时间线。

但是，GetTickCount 不会随着每次调用而增加，因此这是不可靠的。是否有更好的号码，以便我在排序时能够以正确的顺序接到电话？

编辑：感谢 @Greg 让我走上了 QueryPerformanceCounter 的道路，这确实做到了窍门。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ま昔日黯然 2024-07-12 20:20:01

您可以使用 RDTSC CPU 指令（假设为 x86）。该指令提供 CPU 周期计数器，但请注意，它会很快增加到最大值，然后重置为 0。正如 Wikipedia 文章提到的，您最好使用 QueryPerformanceCounter 函数。

回复收藏 0 原文

你的背包 2024-07-12 20:20:01

System.Diagnostics.Stopwatch.GetTimestamp() 返回自时间原点（可能是计算机启动时，但我不确定）以来的 CPU 周期数，并且我从未见过它在两次调用之间没有增加。

CPU 周期对于每台计算机都是特定的，因此您不能使用它来合并两台计算机之间的日志文件。

回复收藏 0 原文

回梦 2024-07-12 20:20:01

RDTSC 输出可能取决于当前内核的时钟频率，对于现代 CPU 来说，该时钟频率既不是恒定的，也不是在多核机器中一致的。

使用系统时间，如果处理来自多个系统的源，请使用 NTP 时间源。通过这种方式，您可以获得可靠、一致的时间读数；如果开销对于您的目的来说太多，请使用 HPET 计算自最后已知的可靠时间读数比单独使用 HPET 更好。

回复收藏 0 原文

音盲 2024-07-12 20:20:01

使用 GetTickCount 并在合并日志文件时添加另一个计数器。不会为您提供不同日志文件之间的完美顺序，但它至少会以正确的顺序保留每个文件中的所有日志。

回复收藏 0 原文

闻呓 2024-07-12 20:20:01

使用 RDTSC。

像 QueryPerformanceCounter 这样的过程只是在底层调用 RDTSC。

RDTSC 大约 16 年以来一直不依赖于动态频率变化。来自Intel 64 和 IA-32 架构软件开发人员手册（第 3B 卷）：

18.17.1 不变的 TSC
较新处理器中的时间戳计数器可能支持增强功能，称为不变 TSC。处理器对不变 TSC 的支持由 CPUID.80000007H:EDX[8] 指示。不变的 TSC 将在所有 ACPI P-、C- 中以恒定速率运行。和 T 状态。这是向前发展的建筑行为。在具有不变 TSC 支持的处理器上，操作系统可以使用 TSC 进行挂钟计时器服务（而不是 ACPI 或 HPET 计时器）。 TSC 读取效率更高，并且不会产生与环转换或访问平台资源相关的开销。

RDTSCP 是一条相关指令，它还提供与读取的时间戳相关的处理器 ID：

18.17.2 IA32_TSC_AUX 寄存器和 RDTSCP 支持
基于 Nehalem 微架构的处理器提供了一个辅助 TSC 寄存器 IA32_TSC_AUX，旨在与 IA32_TSC 结合使用。 IA32_TSC_AUX 提供一个 32 位字段，由特权软件使用签名值（例如逻辑处理器 ID）进行初始化。
IA32_TSC_AUX 与 IA32_TSC 结合使用的主要用途是允许软件在原子操作中使用 RDTSCP 指令读取 IA32_TSC 中的 64 位时间戳和 IA32_TSC_AUX 中的签名值。 RDTSCP 在 EDX:EAX 中返回 64 位时间戳，在 ECX 中返回 32 位 TSC_AUX 签名值。 RDTSCP 的原子性确保 TSC 和 TSC_AUX 值的读取之间不会发生上下文切换。
对 RDTSCP 的支持由 CPUID.80000001H:EDX[27] 表示。与 RDTSC 指令一样，非环 0 访问由 CR4.TSD（时间戳禁用标志）控制。
用户模式软件可以使用 RDTSCP 来检测 TSC 的连续读取之间是否发生了 CPU 迁移。它还可用于调整 NUMA 系统中每个 CPU 的 TSC 值差异。

根据您要测量的内容，将栅栏与对 RDTSC 的调用结合使用可能是合适的：

RDTSCP 指令不是序列化指令，但它会等待，直到所有先前的指令都已执行并且所有先前的加载都是全局可见的。 1 但它不会等待之前的存储全局可见，后续指令可能会在执行读取操作之前开始执行。以下项目可能会指导软件寻求命令执行 RDTSCP：
• 如果软件要求仅在所有先前的存储全局可见之后才执行 RDTSCP，则它可以在 RDTSCP 之前立即执行 MFENCE。
• 如果软件要求在执行任何后续指令（包括任何内存访问）之前执行 RDTSCP，则它可以在 RDTSCP 之后立即执行 LFENCE。

上述内容也适用于 AMD 处理器：

RDTSC 指令的行为取决于实现。 TSC 以恒定速率计数，但可能会受到电源管理事件（例如频率变化）的影响，具体取决于处理器实现。如果 CPUID Fn8000_0007_EDX[TscInvariant] = 1，则确保 TSC 速率在所有 P 状态、C 状态和停止授予转换（例如 STPCLK 节流）中保持不变；因此，TSC 适合用作时间源。有关电源影响的信息，请参阅适用于您的产品的 BIOS 和内核开发人员指南
TSC 上的管理。

AMD 处理器至少从 Bulldozer 开始就拥有不变的 TSC。来自AMD 系列 15h 型号 30h-3Fh 处理器的 BIOS 和内核开发人员指南 (BKDG)：

8 | TscInvariant：TSC 不变。取值：1. TSC 率不变。

参考文献：

https://www. intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer -references/40332.pdf

https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/design-guides/25481.pdf

Use RDTSC.

Procedures like QueryPerformanceCounter are just calling RDTSC under the hood.

RDTSC has not been dependent on dynamic frequency changes for about 16 years. From The Intel 64 and IA-32 Architectures Software Developer’s Manual (Vol. 3B):

18.17.1 Invariant TSC
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

RDTSCP is a related instruction that also provides the processor ID relevant to the timestamp that was read:

18.17.2 IA32_TSC_AUX Register and RDTSCP Support
Processors based on Nehalem microarchitecture provide an auxiliary TSC register, IA32_TSC_AUX that is designed to be used in conjunction with IA32_TSC. IA32_TSC_AUX provides a 32-bit field that is initialized by privileged software with a signature value (for example, a logical processor ID).
The primary usage of IA32_TSC_AUX in conjunction with IA32_TSC is to allow software to read the 64-bit time stamp in IA32_TSC and signature value in IA32_TSC_AUX with the instruction RDTSCP in an atomic operation. RDTSCP returns the 64-bit time stamp in EDX:EAX and the 32-bit TSC_AUX signature value in ECX. The atomicity of RDTSCP ensures that no context switch can occur between the reads of the TSC and TSC_AUX values.
Support for RDTSCP is indicated by CPUID.80000001H:EDX[27]. As with RDTSC instruction, non-ring 0 access is controlled by CR4.TSD (Time Stamp Disable flag).
User mode software can use RDTSCP to detect if CPU migration has occurred between successive reads of the TSC. It can also be used to adjust for per-CPU differences in TSC values in a NUMA system.

Depending on what you're measuring, it may be appropriate to use fences in conjunction with calls to RDTSC:

The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible. 1 But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed. The following items may guide software seeking to order executions of RDTSCP:
• If software requires RDTSCP to be executed only after all previous stores are globally visible, it can execute MFENCE immediately before RDTSCP.
• If software requires RDTSCP to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute LFENCE immediately after RDTSCP.

The above also applies to AMD processors:

The behavior of the RDTSC instruction is implementation dependent. The TSC counts at a constant rate, but may be affected by power management events (such as frequency changes), depending on the processor implementation. If CPUID Fn8000_0007_EDX[TscInvariant] = 1, then the TSC rate is ensured to be invariant across all P-States, C-States, and stop-grant transitions (such as STPCLK Throttling); therefore, the TSC is suitable for use as a source of time. Consult the BIOS and Kernel Developer’s Guide applicable to your product for information concerning the effect of power
management on the TSC.

AMD processors have had an invariant TSC since at least as far back as Bulldozer. From The BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 30h-3Fh Processors:

8 | TscInvariant: TSC invariant. Value: 1. The TSC rate is invariant.

References:

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf

https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/design-guides/25481.pdf

回复收藏 0 原文

请别遗忘我 2024-07-12 20:20:01

这是一篇有趣的文章！说不要使用 RDTSC，而是使用 < a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms644904" rel="nofollow noreferrer">QueryPerformanceCounter。

结论：

使用常规的旧timeGetTime()来做
许多人的计时并不可靠
基于Windows的操作系统
因为系统的粒度
定时器可高达10-15
毫秒，这意味着
timeGetTime() 仅精确到
10-15 毫秒。 [请注意，
高粒度发生在基于 NT 的
操作系统如Windows NT，
2000年和XP。 Windows 95 和 98 趋于
具有更好的粒度，
大约 1-5 毫秒。]
但是，如果您致电
timeBeginPeriod(1) 开始时
您的程序（以及 timeEndPeriod(1) 在
结束），timeGetTime() 通常会
精确到1-2毫秒，
并将为您提供极其
准确的计时信息。
Sleep() 的行为类似；长度
Sleep() 实际睡眠的时间
for 与
timeGetTime() 的粒度，所以之后
调用一次 timeBeginPeriod(1) ，
Sleep(1) 实际上会休眠 1-2 秒
毫秒，Sleep(2) 为 2-3，依此类推
开启（而不是逐步睡眠
高达 10-15 毫秒）。
为了更高精度的计时
（亚毫秒精度），你会
可能想避免使用
汇编助记符 RDTSC 因为它是
难以校准；相反，使用
QueryPerformanceFrequency 和
QueryPerformanceCounter，它们是
精确到10微秒以内
（0.00001 秒）。
对于简单的计时，timeGetTime
和 QueryPerformanceCounter 工作良好，
和 QueryPerformanceCounter 是
显然更准确。然而，如果
你需要做任何类型的“定时
暂停”（例如必要的暂停）
帧速率限制），你需要
小心坐在循环调用中
QueryPerformanceCounter，等待
达到一定值；这会
耗尽处理器 100% 的资源。
相反，考虑一种混合方案，
你在哪里调用 Sleep(1) （不要忘记
timeBeginPeriod(1) 首先！）
你需要通过超过 1 毫秒
时间，然后只输入
QueryPerformanceCounter 100% 繁忙循环
完成最后一个< 1/1000 的
您需要的延迟的秒数。这
将为您提供超准确的延迟
（精确到10微秒），
CPU 使用率非常低。查看代码
如上所述。

Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.

Conclusion:

Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
above.