硬盘直接内存访问的目的是什么?
乍一看,让硬盘自行写入 RAM 似乎是一个好主意,而无需 CPU 指令复制数据,特别是考虑到异步网络的成功。但是关于直接内存访问 (DMA) 的维基百科文章指出了这一点:
通过 DMA,CPU 可以摆脱这种开销,并可以在数据传输期间执行有用的任务(尽管 CPU 总线会被 DMA 部分阻塞)。
我不明白公交线路如何被“部分阻塞”。据推测,内存当时可以由一台设备访问,然后 CPU 似乎几乎没有什么有用的工作可以做。第一次尝试读取未缓存的内存时,它会被阻止,我预计在 2 MB 缓存的情况下,速度会很快。
释放 CPU 来执行其他任务的目标似乎是没有道理的。硬盘 DMA 在实践中是否会促进性能提升?
At first glance it seems like a good idea to let the hard disk write to RAM on its own, without CPU instructions copying data, particularly with the success of asynchronous networking in mind. But the Wikipedia article on Direct Memory Access (DMA) states this:
With DMA, the CPU gets freed from this overhead and can do useful tasks during data transfer (though the CPU bus would be partly blocked by DMA).
I don't understand how a bus line can be "partly blocked". Presumably memory can be accessed by one device at the time, and it then seems like there is little useful work the CPU can actually do. It would be blocked on the first attempt to read uncached memory, which I expect is very quickly in the case of a 2 mb cache.
The goal of freeing up the CPU to do other tasks seems gratuitous. Does hard disk DMA foster any performance increase in practice?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
1:PIO(编程IO)扰动CPU缓存。大多数时候,从磁盘读取的数据不会立即被处理。应用程序通常以大块的形式读取数据,但 PIO 是以较小的块(通常为 64K IIRC)完成的。因此,数据读取应用程序将等待直到大块被传输,并且在从控制器获取小块之后,不会从缓存中的小块中受益。同时,其他应用程序将遭受大部分缓存因传输而被逐出的影响。这可能可以通过使用特殊指令来避免,这些指令指示 CPU 不要缓存数据,而是将其“直接”写入主内存,但是我非常确定这会减慢复制循环的速度。因此造成的伤害甚至比缓存抖动还要严重。
2:PIO 由于是在 x86 系统以及可能大多数其他系统上实现的,因此与 DMA 相比确实很慢。问题不在于 CPU 不够快。问题源于总线和磁盘控制器的 PIO 模式的设计方式。如果我没记错的话,CPU 必须从所谓的 IO 端口读取每个字节(或使用 32 位 PIO 模式时的每个 DWORD)。这意味着对于每个 DWORD 数据,端口的地址必须放在总线上,并且控制器必须通过将数据 DWORD 放在总线上进行响应。而使用 DMA 时,控制器可以利用总线和/或内存控制器的全部带宽传输数据突发。当然,这种传统的 PIO 设计还有很大的优化空间。 DMA 传输就是这样一种优化。仍被视为 PIO 的其他解决方案也可能是可能的,但它们仍然会遇到其他问题(例如上面提到的缓存抖动)。
3: 内存和/或总线带宽不是大多数应用的限制因素,因此 DMA 传输不会造成任何延迟。它可能会稍微减慢某些应用程序的速度,但通常应该很难注意到。毕竟,与总线和/或内存控制器的带宽相比,磁盘相当慢。 “磁盘”(SSD、RAID 阵列)可提供 > 500 MB/s 确实很快。不能至少提供 10 倍该数字的总线或内存子系统必定来自石器时代。 OTOH PIO 在传输数据块时确实会完全停止 CPU。
1: PIO (programmed IO) thrashes the CPU caches. The data read from the disk will, most of the time, not be processed immediately afterwards. Data is often read in large chunks by the application, but PIO is done in smaller blocks (typically 64K IIRC). So the data-reading application will wait until the large chunk has been transferred, and not benefit from the smaller blocks being in the cache just after they have been fetched from the controller. Meanwhile other applications will suffer from large parts of the cache being evicted by the transfer. This could probably be avoided by using special instructions which instruct the CPU not to cache data but write it "directly" to the main memory, however I'm pretty certain that this would slow down the copy-loop. And thereby hurt even more than the cache-thrashing.
2: PIO, as it's implemented on x86 systems, and probably most other systems, is really slow compared to DMA. The problem is not that the CPU wouldn't be fast enough. The problem stems from the way the bus and the disk controller's PIO modes are designed. If I'm not mistaken, the CPU has to read every byte (or every DWORD when using 32 bit PIO modes) from a so-called IO port. That means for every DWORD of data, the port's address has to be put on the bus, and the controller must respond by putting the data DWORD on the bus. Whereas when using DMA, the controller can transfer bursts of data, utilizing the full bandwidth of the bus and/or memory controller. Of course there is much room for optimizing this legacy PIO design. DMA transfers are such an optimization. Other solutions that would still be considered PIO might be possible too, but then again they would still suffer from other problems (e.g. the cache thrashing mentioned above).
3: Memory- and/or bus-bandwidth is not the limiting factor for most applications, so the DMA transfer will not stall anything. It might slow some applications down a little, but usually it should be hardly noticeable. After all disks are rather slow compared with the bandwidth of the bus and/or memory controller. A "disk" (SSD, RAID array) that delivers > 500 MB/s is really fast. A bus or memory subsystem that cannot at least deliver 10 times that number must be from the stone ages. OTOH PIO really stalls the CPU completely while it's transferring a block of data.
我不知道我是否遗漏了什么。
假设我们没有 DMA 控制器。从“慢速”设备到内存的每次传输对于 CPU 来说都是一个循环,
因此 CPU 应该自己写入内存。一块一块地。
是否需要使用CPU来进行内存传输?不。我们使用另一个设备(或 DMA 总线主控等机制)将数据传输到内存或从内存传输数据。
同时,CPU 可能会做一些不同的事情,例如:使用缓存做事情,但甚至花大量时间访问内存的其他部分。
这是关键部分:数据并未 100% 地传输,因为其他设备速度非常慢(与内存和 CPU 相比)。
尝试表示共享内存总线使用情况的示例(C 当由 CPU 访问时,D,当由 DMA 访问时)
正如您所看到的,内存一次访问一个设备。有时由 CPU,有时由 DMA 控制器。 DMA 的次数很少。
I don't know if I'm missing anything.
Let's suppose we don't have DMA controller. Every transfer from the "slow" devices to the memory would be for the CPU a loop
So the CPU should have to write the memory itself. Chunk by chunk.
Is it necessary the use of a CPU for doing memory transfers? No. We use another device (or mecanism like DMA bus mastering) which transfers data to/from the memory.
Meanwhile CPU could be doing something different like : doing things with cache, but even accessing other parts of the memory a great share of the time.
This is the crucial part: data is not being transfered 100% of the time, because the other device is very slow (compared to memory and CPU).
Trying to represent an example of the shared memory bus usage (C when accesed by CPU, D, when accesed by DMA)
As you can see memory is accesed one device at a time. Sometimes by CPU, sometimes by the DMA controller. The DMA very few times.
在许多时钟周期的一段时间内,有些会被阻塞,有些不会。引用墨尔本大学:
即使在 DMA 块传输发生时 CPU 完全处于饥饿状态,它也会比 CPU 必须处于循环中将字节移入 I/O 设备或从 I/O 设备移出的速度更快。
Over a period of many clock cycles, some will be blocked and some will not. Quoting the University of Melbourne:
Even if the CPU is completely starved while a DMA block transfer is occurring, it will happen faster than if the CPU had to sit in a loop shifting bytes to/from an I/O device.
磁盘控制器通常具有特殊的块传输指令,可以实现快速数据传输。它们还可以突发传输数据,从而允许交错的 CPU 总线访问。 CPU 还倾向于突发性地访问内存,高速缓存控制器会在高速缓存行可用时填充它们,因此即使 CPU 可能被阻塞,最终结果也只是高速缓存使用率下降,而 CPU 实际上并没有停止运行。
Disk controllers often have special block transfer instructions that enable fast data transfers. They may also transfer data in bursts, permitting interleaved CPU bus access. CPUs also tend to access memory in bursts, with the cache controller filling cache lines as they become available, so even though the CPU may be blocked, the end result is simply that the cache usage drops, the CPU doesn't actually stall.
计算机可以拥有多个 DMA 设备,这一事实可能会提高性能。因此,使用 DMA,您可以并行进行多个内存读取,而无需 CPU 执行所有开销。
One possible performance increase can come from the fact that a computer can have multiple DMA devices. So with DMA you can have multiple memory reads occuring in parallel without the CPU having to perform all the overhead.
无论如何,处理不会发生在 CPU 总线上。 CPU 发出的指令可能会或可能不会触及内存。当它们这样做时,通常首先针对 L1 缓存进行解析,然后在尝试内存之前针对 L2 和 L3 进行解析。因此,DMA 传输不会阻塞处理。
即使 CPU 和 DMA 传输都需要内存,预计它们也不会访问内存中的相同字节。事实上,内存控制器可能能够同时处理这两个请求。
Processing doesn't happen on the CPU bus anyway. CPU's issue instructions that might or might not touch memory. When they do, they're typically resolved first against L1 cache, and then L2 and L3 before memory is tried. Therefore, DMA transfers don't block processing.
Even when the CPU and the DMA transfer would both need memory, it's expected that they will not access the same bytes in memory. A memory controller might in fact be able to process both requests at the same time.
如果您使用的是 Linux,则可以通过使用 hdparm 禁用 DMA 来轻松测试这一点。效果是戏剧性的。
If you're using Linux, you can test this very easily by disabling DMA with hdparm. The effect is dramatic.