x86 CPU 上中断延迟的估计

发布于 2024-11-27 02:58:08 字数 446 浏览 5 评论 0 原文

我正在寻找有助于估计 x86 CPU 中断延迟的信息。这篇非常有用的论文可以在“datasheets.chipdb.org/Intel/x86/386/technote/2153.pdf”中找到。但这篇论文给我提出了一个非常重要的问题：如何定义等待当前指令完成所提供的延迟？我的意思是识别 INTR 信号和执行 INTR 微代码之间的延迟。我记得，英特尔软件开发人员手册还介绍了有关等待当前执行指令完成的信息。但它也告诉我们一些指令可以在执行过程中被中断。主要问题是：如何为特定处理器定义最大完成指令等待长度。需要以核心滴答声和内存访问操作为单位进行估计，而不是以秒或微秒为单位。应考虑缓存和 TLD 未命中以及其他可能影响等待的因素。

需要进行此估计来研究实现不会影响中断延迟的小型关键部分的可能性。为了实现这一点，关键部分的长度必须小于或等于 CPU 最长的不间断指令的长度。

非常欢迎任何形式的帮助。如果您知道一些有用的论文，请分享其链接。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倒数 2024-12-04 02:58:08

如果 agner folg 的优化手册（由英特尔开发人员手册补充）没有任何内容，那么任何人/其他任何东西都不太可能有（除了一些内部英特尔/AMD 数据）： org/optimize/" rel="nofollow">http://www.agner.org/optimize/

回复收藏 0 原文

眼中杀气 2024-12-04 02:58:08

一般来说，中断延迟没有保证上限。请考虑以下示例：

通过执行 sti 指令来禁用可屏蔽中断，该指令设置 IF 标志。
通过执行hlt指令，处理器转换到C1睡眠状态。
发生可屏蔽中断，其关联性指定它只能在该处理器上处理。

在这种情况下，处理器不会处理中断，直到发生不可屏蔽中断以唤醒处理器并且IF标志被清除以允许处理可屏蔽中断。

如果所有应该处理中断的处理器都处于非常深度的睡眠状态，则任何中断（包括不可屏蔽中断）的中断延迟可能约为数百微秒。在我的 Haswell 处理器上，C7 状态的唤醒延迟为 133 us。如果这对您来说是个问题，您可以使用 Linux 内核参数 intel_idle.max_cstate（如果使用 intel_idle 驱动程序，这是 Intel 处理器上的默认设置）或 processor.max_cstate< /code> （对于 acpi_idle 驱动程序）限制最深的 C 状态。您可以使用idle=poll告诉内核永远不要让任何核心进入睡眠状态，这可以最大限度地减少空闲核心上的中断延迟，当然前提是频率不会由于热限制而降低。使用轮询循环还会降低所有内核的最大睿频频率，这可能会降低系统的整体性能。

在活动内核（处于状态 C0）上，仅当内核处于可中断状态时才接受硬件中断。此状态发生在指令边界处，但可中断的字符串指令除外。 Intel 没有提供在接受待处理中断之前退出的指令数量的上限。合理的实现可以停止将微指令发送到ROB（在指令边界处）并等待直到ROB中的所有微指令退出，然后才开始执行用于调用中断处理程序的微码例程。在这样的实现中，中断延迟取决于退出所有挂起的微指令所花费的时间。加载、复杂浮点运算和锁定指令等高延迟指令很容易使中断延迟达到数百纳秒量级。然而，如果待处理的微指令之一出于任何原因（或某些特定原因）需要微代码辅助，则处理器可以选择刷新该指令和所有后续指令，而不是调用辅助。这种实现提高了性能和功耗，但代价是增加了中断延迟。

在另一种为最小化中断延迟而调整的实现中，所有运行中的指令都会立即刷新，而不会退出任何内容。但是，所有这些经过管道的刷新指令以及其中一些可能已经完成的指令都需要被取出，并在中断处理程序返回时再次经过管道。这会导致性能下降和功耗增加。

硬件中断会耗尽 Intel 和 AMD x86 处理器上的存储缓冲区和写入组合缓冲区。请参阅：中断正在运行的汇编指令。

英特尔的一篇论文，标题为通过使用消息信号中断减少中断延迟讨论了一种测量 PCIe 设备中断延迟的方法。本文使用的术语“中断延迟”与您提到的论文中的“中断响应时间”含义相同。您需要以某种方式在中断到达处理器时获取时间戳，然后在中断处理程序的最开始获取另一个时间戳。可以通过将两者相减来计算中断延迟的近似值。问题当然是获取第一个时间戳（也以与第二个时间戳相当的方式）。 Intel论文提出使用PCIe分析器，它由一个PCIe设备和一个应用程序组成，该应用程序记录设备和CPU之间的所有带有时间戳的PCIe流量。他们使用设备驱动程序写入从中断处理程序映射到设备的 MMIO 位置，以创建第二个时间戳。

In general, there is no guaranteed upper bound on interrupt latency. Consider the following example:

Maskable interrupts are disabled by executing the sti instruction, which sets the IF flag.
A processor is transitioned to the C1 sleep state by executing the hlt instruction.
A maskable interrupt occurs whose affinity specifies that it can only be handled on that processor.

In this case, the processor will not handle the interrupt until an unmaskable interrupt occurs to wake up the processor and the IF flag is cleared to enable handling maskable interrupts.

The interrupt latency for any interrupt (including unmaskable interrupts) can be in the order of hundreds of microseconds if all the processors that are supposed to handle the interrupt are in a very deep sleep state. On my Haswell processor, the wakeup latency of the C7 state is 133 usecs. If this is an issue for you, you can use the Linux kernel parameter intel_idle.max_cstate (in case the intel_idle driver is used, which is the default on Intel processors) or processor.max_cstate (for the acpi_idle driver) to limit the deepest C-state. You can tell the kernel to never put any core to sleep using idle=poll, which may minimize the interrupt latency on an idle core, assuming of course that the frequency is not reduced due to thermal throttling. Using a polling loop also reduces the maximum turbo frequency of all cores, which may reduce overall performance of the system.

On an active core (in state C0), a hardware interrupt is only accepted when the core is an interruptible state. This state occurs at instruction boundaries, except for string instructions, which are interruptible. Intel does not provide an upper bound on the number of instructions that are retired before a pending interrupt is accepted. A reasonable implementation may stop issuing uops into the ROB (at an instruction boundary) and wait until all uops in the ROB retire before beginning the execution of the microcode routine for invoking an interrupt handler. In such an implementation, the interrupt latency depends on the time it takes to retire all of the pending uops. High latency instructions such as loads, complex floating-point arithmetic, and locked instructions can easily make the interrupt latency in the order of hundreds of nanoseconds. However, if one of the pending uops requires a microcode assist for any reason (or some specific reasons), the processor may choose to flush the instruction and all later instructions, instead of invoking the assist. This implementation improves performance and power consumption at cost of increased interrupt latency.

In another implementation tuned for minimizing interrupt latency, all in-flight instructions are immediately flushed without retiring anything. But all of these flushed instructions which went through the pipeline and some of which might have already been completed need to be fetched and go through the pipeline again when the interrupt handler returns. This results in reduced performance and increased power consumption.

Hardware interrupts drain the store buffer and the write-combining buffers on Intel and AMD x86 processors. See: Interrupting an assembly instruction while it is operating.

A paper from Intel titled Reducing Interrupt Latency Through the Use of Message Signaled Interrupts discusses a methodology to measure the latency of an interrupt from a PCIe device. This paper uses the term "interrupt latency" to mean the same thing as "interrupt response time" from the paper you mentioned. You need to somehow take a timestamp at the time the interrupt reaches the processor and then another timestamp at the very beginning of the interrupt handler. An approximation of the interrupt latency can be calculated by subtracting the two. The problem is of course getting the first timestamp (also in a way that is comparable to the second timestamp). The Intel paper proposes to use a PCIe analyzer, which consists of a PCIe device and an application that records all PCIe traffic with timestamps between the device and the CPU. They use a device driver to write to an MMIO location mapped to the device from the interrupt handler to create the second timestamp.

回复收藏 0 原文

~没有更多了~