捕获数据包后会发生什么?
我一直在阅读关于网卡捕获数据包后会发生什么的内容,我读得越多,我就越困惑。
首先,我读过传统上,在网卡捕获数据包后,它会被复制到内核空间中的一个内存块,然后复制到用户空间,供随后处理数据包数据的任何应用程序使用。然后我读到了 DMA,其中 NIC 直接将数据包复制到内存中,绕过 CPU。网卡也是如此->内核内存->用户空间内存流仍然有效吗?另外,大多数NIC(例如Myricom)是否使用DMA来提高数据包捕获率?
其次,RSS(接收端缩放)在 Windows 和 Linux 系统中的工作方式是否相似?我只能在 MSDN 文章中找到关于 RSS 如何工作的详细解释,其中讨论了 RSS(和 MSI-X)如何在 Windows Server 2008 上工作。但是 RSS 和 MSI-X 的相同概念应该仍然适用于 Linux 系统,对吧?
谢谢。
问候, 雷恩
I've been reading about what happens after packets are captured by NICs, and the more I read, the more I'm confused.
Firstly, I've read that traditionally, after a packet is captured by the NIC, it gets copied to a block of memory in the kernel space, then to the user space for whatever application that then works on the packet data. Then I read about DMA, where the NIC directly copies the packet into memory, bypassing the CPU. So is the NIC -> kernel memory -> User space memory flow still valid? Also, do most NIC (e.g. Myricom) use DMA to improve packet capture rates?
Secondly, does RSS (Receive Side Scaling) work similarly in both Windows and Linux systems? I can only find detailed explanations on how RSS works in MSDN articles, where they talk about how RSS (and MSI-X) works on Windows Server 2008. But the same concept of RSS and MSI-X should still apply for linux systems, right?
Thank you.
Regards,
Rayne
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个过程如何进行主要取决于驱动程序作者和硬件,但对于我查看或编写的驱动程序以及我使用过的硬件,通常是这样的工作方式:
内核中的零拷贝网络并没有那么糟糕。一直到用户空间的零复制要困难得多。用户层获取数据,但网络数据包由标头和数据组成。至少,真正的零复制一直到用户态都需要 NIC 的支持,以便它可以将 DMA 数据包传输到单独的标头/数据缓冲区中。一旦内核将数据包路由到其目的地并验证校验和(对于 TCP,如果 NIC 支持则在硬件中,如果不支持则在软件中),标头将被回收;请注意,如果内核必须自己计算校验和,它会也可以复制数据:查看数据会导致缓存未命中,并且通过调整代码可以免费将其复制到其他地方)。
即使假设所有星星都对齐,当系统接收到数据时,数据实际上并不在用户缓冲区中。在应用程序请求数据之前,内核不知道数据最终会去哪里。考虑像 Apache 这样的多进程守护进程的情况。有许多子进程,它们都在同一个套接字上侦听。您还可以建立连接,
fork()
,并且两个进程都能够recv()
传入数据。互联网上的 TCP 数据包通常为 1460 字节的有效负载(1500 的 MTU = 20 字节 IP 标头 + 20 字节 TCP 标头 + 1460 字节数据)。 1460 不是 2 的幂,并且与您发现的任何系统上的页面大小都不匹配。这给数据流的重组带来了问题。请记住,TCP 是面向流的。发送方写入之间没有区别,在接收端等待的两个 1000 字节写入将完全消耗在 2000 字节读取中。
更进一步,考虑用户缓冲区。这些由应用程序分配。为了始终用于零复制,缓冲区需要进行页面对齐,并且不与其他任何内容共享该内存页面。在
recv()
时,理论上,内核可以将旧页面重新映射到包含数据的页面,并将其“翻转”到位,但这会因上述重组问题而变得复杂,因为连续的数据包将打开单独的页面。内核可以限制它交还给每个数据包有效负载的数据,但这将意味着大量额外的系统调用、页面重新映射以及可能较低的总体吞吐量。我实际上只是触及了这个主题的表面。 2000 年代初期,我在几家公司工作过,试图将零拷贝概念扩展到用户空间。我们甚至在用户层实现了 TCP 堆栈,并为使用该堆栈的应用程序完全规避了内核,但这带来了一系列问题,而且从来都不是生产质量问题。这是一个非常难解决的问题。
How this process plays out is mostly up to the driver author and the hardware, but for the drivers I've looked at or written and the hardware I've worked with, this is usually the way it works:
Zero-copy networking within the kernel isn't so bad. Zero-copy all the way down to userland is much harder. Userland gets data, but network packets are made up of both header and data. At the least, true zero-copy all the way to userland requires support from your NIC so that it can DMA packets into separate header/data buffers. The headers are recycled once the kernel routes the packet to its destination and verifies the checksum (for TCP, either in hardware if the NIC supports it or in software if not; note that if the kernel has to compute the checksum itself, it'd may as well copy the data, too: looking at the data incurs cache misses and copying it elsewhere can be for free with tuned code).
Even assuming all the stars align, the data isn't actually in your user buffer when it is received by the system. Until an application asks for the data, the kernel doesn't know where it will end up. Consider the case of a multi-process daemon like Apache. There are many child processes, all listening on the same socket. You can also establish a connection,
fork()
, and both processes are able torecv()
incoming data.TCP packets on the Internet are usually 1460 bytes of payload (MTU of 1500 = 20 byte IP header + 20 byte TCP header + 1460 bytes data). 1460 is not a power of 2 and won't match a page size on any system you'll find. This presents problems for reassembly of the data stream. Remember that TCP is stream-oriented. There is no distinction between sender writes, and two 1000 byte writes waiting at the received will be consumed entirely in a 2000 byte read.
Taking this further, consider the user buffers. These are allocated by the application. In order to be used for zero-copy all the way down, the buffer needs to be page-aligned and not share that memory page with anything else. At
recv()
time, the kernel could theoretically remap the old page with the one containing the data and "flip" it into place, but this is complicated by the reassembly issue above since successive packets will be on separate pages. The kernel could limit the data it hands back to each packet's payload, but this will mean a lot of additional system calls, page remapping and likely lower throughput overall.I'm really only scratching the surface on this topic. I worked at a couple of companies in the early 2000s trying to extend the zero-copy concepts down into userland. We even implemented a TCP stack in userland and circumvented the kernel entirely for applications using the stack, but that brought its own set of problems and was never production quality. It's a very hard problem to solve.
看看这篇论文, http://www.ece.virginia .edu/cheetah/documents/papers/TCPlinux.pdf 它可能有助于解决一些内存管理问题
take a look at this paper, http://www.ece.virginia.edu/cheetah/documents/papers/TCPlinux.pdf it might help clearing out some of the memory management questions