当前位置：文江博客话题详情

片外 memcpy？

发布于 2024-11-29 15:19:12 字数 322 浏览 5 评论 0原文

我今天在工作中分析了一个程序，该程序执行大量缓冲网络活动，该程序大部分时间都花在 memcpy 上，只是在库管理的网络缓冲区和它自己的内部缓冲区之间来回移动数据。

这让我开始思考，为什么英特尔没有一个“memcpy”指令来允许 RAM 本身（或 CPU 外的内存硬件）在不接触 CPU 的情况下移动数据？事实上，每个单词都必须一直传送到 CPU，然后再次推回，而整个过程可以由内存本身异步完成。

是否存在某种架构原因导致这不切实际？显然，有时副本会发生在物理内存和虚拟内存之间，但如今随着 RAM 成本的增加，这种情况正在减少。有时，处理器最终会等待复制完成，以便可以使用结果，但肯定并非总是如此。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光无声 2024-12-06 15:19:12

这是一个大问题，包括网络堆栈效率，但我会坚持您的具体指令问题。您建议的是异步非阻塞复制指令，而不是现在使用“rep mov”可用的同步阻塞 memcpy。

一些架构和实际问题：

1）非阻塞 memcpy 必须消耗一些物理资源，例如复制引擎，其生命周期可能与相应的操作系统进程不同。这对于操作系统来说是非常讨厌的。假设线程 A 在上下文切换到线程 B 之前启动了 memcpy。线程 B 也想要执行 memcpy，并且优先级比 A 高得多。它必须等待线程 A 的 memcpy 完成吗？如果A的memcpy有1000GB长怎么办？在核心中提供更多的复制引擎可以推迟但不能解决问题。基本上，这打破了操作系统时间量和调度的传统滚动。

2) 为了像大多数指令一样通用，任何代码都可以随时发出 memcpy 指令，而不考虑其他进程已经做了或将要做什么。核心必须对任何一次运行中的异步 memcpy 操作的数量有一定的限制，因此当下一个进程出现时，它的 memcpy 可能处于任意长积压的末尾。异步副本缺乏任何类型的确定性，开发人员只会退回到老式的同步副本。

3) 缓存局部性对性能具有第一顺序的影响。 L1 高速缓存中已有缓冲区的传统副本速度快得令人难以置信，并且相对节能，因为至少目标缓冲区仍保留在核心 L1 的本地。在网络复制的情况下，从内核到用户缓冲区的复制发生在将用户缓冲区交给应用程序之前。因此，该应用程序具有 L1 命中率和出色的效率。如果异步 memcpy 引擎位于核心以外的任何位置，则复制操作会将行从核心拉出（窥探），从而导致应用程序缓存未命中。网络系统效率可能会比今天差很多。

4) asynch memcpy 指令必须返回某种标记来标识副本，以便稍后用于询问副本是否完成（需要另一条指令）。给定令牌，核心将需要针对特定的待处理或正在进行的副本执行某种复杂的上下文查找——这些类型的操作由软件比核心微代码更好地处理。如果操作系统需要终止进程并清除所有正在进行的和挂起的 memcpy 操作怎么办？操作系统如何知道一个进程使用该指令多少次以及哪些相应的令牌属于哪个进程？

--- 编辑 ---

5) 另一个问题：核心之外的任何复制引擎都必须在原始复制性能方面与核心的缓存带宽竞争，该带宽非常高，远高于外部内存带宽。对于缓存未命中，内存子系统将同样成为同步和异步 memcpy 的瓶颈。对于至少有一些数据在缓存中的任何情况（这是一个不错的选择），核心将比外部复制引擎更快地完成复制。

That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".

Some architectural and practical problems:

1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.

2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.

3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.

4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?

--- EDIT ---

5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.

回复收藏 0 原文