片外 memcpy?

发布于 2024-11-29 15:19:12 字数 322 浏览 5 评论 0原文

我今天在工作中分析了一个程序,该程序执行大量缓冲网络活动,该程序大部分时间都花在 memcpy 上,只是在库管理的网络缓冲区和它自己的内部缓冲区之间来回移动数据。

这让我开始思考,为什么英特尔没有一个“memcpy”指令来允许 RAM 本身(或 CPU 外的内存硬件)在不接触 CPU 的情况下移动数据?事实上,每个单词都必须一直传送到 CPU,然后再次推回,而整个过程可以由内存本身异步完成。

是否存在某种架构原因导致这不切实际?显然,有时副本会发生在物理内存和虚拟内存之间,但如今随着 RAM 成本的增加,这种情况正在减少。有时,处理器最终会等待复制完成,以便可以使用结果,但肯定并非总是如此。

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.

This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.

Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

时光无声 2024-12-06 15:19:12

这是一个大问题,包括网络堆栈效率,但我会坚持您的具体指令问题。您建议的是异步非阻塞复制指令,而不是现在使用“rep mov”可用的同步阻塞 memcpy。

一些架构和实际问题:

1)非阻塞 memcpy 必须消耗一些物理资源,例如复制引擎,其生命周期可能与相应的操作系统进程不同。这对于操作系统来说是非常讨厌的。假设线程 A 在上下文切换到线程 B 之前启动了 memcpy。线程 B 也想要执行 memcpy,并且优先级比 A 高得多。它必须等待线程 A 的 memcpy 完成吗?如果A的memcpy有1000GB长怎么办?在核心中提供更多的复制引擎可以推迟但不能解决问题。基本上,这打破了操作系统时间量和调度的传统滚动。

2) 为了像大多数指令一样通用,任何代码都可以随时发出 memcpy 指令,而不考虑其他进程已经做了或将要做什么。核心必须对任何一次运行中的异步 memcpy 操作的数量有一定的限制,因此当下一个进程出现时,它的 memcpy 可能处于任意长积压的末尾。异步副本缺乏任何类型的确定性,开发人员只会退回到老式的同步副本。

3) 缓存局部性对性能具有第一顺序的影响。 L1 高速缓存中已有缓冲区的传统副本速度快得令人难以置信,并且相对节能,因为至少目标缓冲区仍保留在核心 L1 的本地。在网络复制的情况下,从内核到用户缓冲区的复制发生在将用户缓冲区交给应用程序之前。因此,该应用程序具有 L1 命中率和出色的效率。如果异步 memcpy 引擎位于核心以外的任何位置,则复制操作会将行从核心拉出(窥探),从而导致应用程序缓存未命中。网络系统效率可能会比今天差很多。

4) asynch memcpy 指令必须返回某种标记来标识副本,以便稍后用于询问副本是否完成(需要另一条指令)。给定令牌,核心将需要针对特定​​的待处理或正在进行的副本执行某种复杂的上下文查找——这些类型的操作由软件比核心微代码更好地处理。如果操作系统需要终止进程并清除所有正在进行的和挂起的 memcpy 操作怎么办?操作系统如何知道一个进程使用该指令多少次以及哪些相应的令牌属于哪个进程?

--- 编辑 ---

5) 另一个问题:核心之外的任何复制引擎都必须在原始复制性能方面与核心的缓存带宽竞争,该带宽非常高,远高于外部内存带宽。对于缓存未命中,内存子系统将同样成为同步和异步 memcpy 的瓶颈。对于至少有一些数据在缓存中的任何情况(这是一个不错的选择),核心将比外部复制引擎更快地完成复制。

That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".

Some architectural and practical problems:

1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.

2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.

3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.

4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?

--- EDIT ---

5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.

日暮斜阳 2024-12-06 15:19:12

旧 PC 架构中的 DMA 控制器过去支持内存到内存传输。如今其他架构中也存在类似的支持(例如 TI DaVinci 或 OMAP 处理器)。

问题是它会占用您的内存带宽,这可能成为许多系统的瓶颈。正如 srking 的回答所暗示的那样,将数据读入 CPU 的缓存,然后将其复制到那里,比内存到内存 DMA 效率更高。即使 DMA 看起来在后台工作,也会与 CPU 发生总线争用。没有免费的午餐。

更好的解决方案是某种零复制架构,其中缓冲区在应用程序和驱动程序/硬件之间共享。也就是说,传入的网络数据直接读入预分配的缓冲区,不需要复制,传出的数据直接从应用程序的缓冲区读取到网络硬件。我已经在嵌入式/实时网络堆栈中看到了这一点。

Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).

The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.

A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.

影子是时光的心 2024-12-06 15:19:12

净赢?

目前尚不清楚实现异步复制引擎是否会有帮助。这种事情的复杂性会增加开销,可能会抵消好处,并且仅仅对于少数受 memcpy() 限制的程序来说这是不值得的。

更重的用户上下文?

实现要么涉及用户上下文,要么涉及每个核心的资源。一个紧迫的问题是,由于这是一项可能长时间运行的操作,因此它必须允许中断并自动恢复。

这意味着,如果实现是用户上下文的一部分,则它代表必须在每次上下文切换时保存的更多状态,或者它必须覆盖现有状态。

覆盖现有状态正是字符串移动指令的工作方式:它们将参数保存在通用寄存器中。但是,如果现有状态被消耗,那么该状态在操作期间没有用处,那么也可以只使用字符串移动指令,这就是内存复制功能实际工作的方式。

或者远程内核资源?

如果它使用某种每核心状态,那么它必须是内核管理的资源。随之而来的环交叉开销(内核陷阱和返回)非常昂贵,并且会进一步限制收益或将其变成惩罚。

主意!让超快的 CPU 来做这件事!

另一种看待这个问题的方式是,在所有这些高速缓存环的中心已经有一个高度调整且非常快速的内存移动引擎,必须与移动结果保持一致。那东西:CPU。如果程序需要这样做,那么为什么不应用快速而复杂的硬件来解决问题呢?

Net Win?

It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.

Heavier User Context?

An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.

And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.

Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.

Or Distant Kernel Resource?

If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.

Idea! Have that super-fast CPU thing do it!

Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文