*nix 系统中 NIC 数据包和用户应用程序之间的内存副本计数?

发布于 2024-08-29 08:22:04 字数 314 浏览 2 评论 0原文

这只是与我一直想知道的一些高性能计算相关的一般问题。某个低延迟消息传递供应商在其支持文档中谈到使用原始套接字将数据直接从网络设备传输到用户应用程序,并在这样做时谈到比无论如何都进一步减少消息传递延迟(在其他公认的情况下)深思熟虑的设计决策)。

因此,我的问题是针对那些熟悉 Unix 或类 Unix 系统上的网络堆栈的人。使用这种方法他们可能能够实现多大的差异?请随意回答内存副本、获救鲸鱼的数量或威尔士大小的区域;)

据我了解,他们的消息传递是基于 UDP 的,因此建立 TCP 连接等没有问题。我们将不胜感激地考虑这个话题!

最美好的祝愿,

迈克

This is just a general question relating to some high-performance computing I've been wondering about. A certain low-latency messaging vendor speaks in its supporting documentation about using raw sockets to transfer the data directly from the network device to the user application and in so doing it speaks about reducing the messaging latency even further than it does anyway (in other admittedly carefully thought-out design decisions).

My question is therefore to those that grok the networking stacks on Unix or Unix-like systems. How much difference are they likely to be able to realise using this method? Feel free to answer in terms of memory copies, numbers of whales rescued or areas the size of Wales ;)

Their messaging is UDP-based, as I understand it, so there's no problem with establishing TCP connections etc. Any other points of interest on this topic would be gratefully thought about!

Best wishes,

Mike

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

北恋 2024-09-05 08:22:04

有一些图片http://vger.kernel.org/~davem/tcp_output.html< /a>
谷歌搜索 tcp_transmit_skb() 这是 tcp 数据路径的关键部分。他的网站上有一些更有趣的东西
http://vger.kernel.org/~davem/

在数据路径的 user - tcp transmit 部分中,存在从用户到 skb 的 1 个副本,其中 skb_copy_to_page (当通过tcp_sendmsg()发送)并使用do_tcp_sendpages0 copy(由tcp_sendpage(调用) ))。需要复制来保留数据备份,以防未交付段的情况。内核中的 skb 缓冲区可以被克隆,但它们的数据将保留在第一个(原始)skb 中。 Sendpage 可以从其他内核部分获取页面并保留它进行备份(我认为有类似 COW 的东西)

调用路径(从 lxr 手动)。发送tcp_push_one/__tcp_push_pending_frames

tcp_sendmsg() <-  sock_sendmsg <- sock_readv_writev <- sock_writev <- do_readv_writev

tcp_sendpage() <- file_send_actor <- do_sendfile 

接收tcp_recv_skb()

tcp_recvmsg() <-  sock_recvmsg <- sock_readv_writev <- sock_readv <- do_readv_writev

tcp_read_sock() <- ... spliceread for new kernels.. smth sendfile for older

接收中可以有1个副本从内核到用户skb_copy_datagram_iovec(从tcp_recvmsg调用)。对于 tcp_read_sock() 来说,可以有副本。它将调用sk_read_actor回调函数。如果它对应于文件或内存,它可能(也可能不)从 DMA 区域复制数据。如果是其他网络,它有一个接收到的数据包的 skb,并且可以就地重用其数据。

对于 udp - 接收 = 1 个副本 - 从 udp_recvmsg 调用 skb_copy_datagram_iovec。传输 = 1 个副本 -- udp_sendmsg -> ip_append_data -> ip_append_data -> getfrag(似乎是 ip_generic_getfrag,有 1 个来自用户的副本,但可能是一个没有页面复制的 smth sendpage/splicelike。)

一般来说,从用户空间发送/接收到用户空间时必须至少有 1 个副本,而使用零副本时必须有 0 个副本(惊讶!)带有数据的内核空间源/目标缓冲区。所有标头均在不移动数据包的情况下添加,启用 DMA 的(所有现代)网卡将从启用 DMA 的地址空间中的任何位置获取数据。对于古老的卡,需要PIO,因此会多一份副本,从内核空间到PCI/ISA/smthelse I/O寄存器/内存。

UPD:在从NIC(但这与NIC相关,我检查了8139too)到tcp堆栈的路径中,还有另一个副本:从rx_ring到skb,接收相同:从skb到tx缓冲区<强>+1副本。您必须填写 ip 和 tcp 标头,但是 skb 是否包含它们或它们的位置?

There are some pictures http://vger.kernel.org/~davem/tcp_output.html
Googled with tcp_transmit_skb() which is a key part of tcp datapath. There are some more interesting thing on his site http://vger.kernel.org/~davem/

In user - tcp transmit part of datapath there is 1 copy from user to skb with skb_copy_to_page (when sending by tcp_sendmsg()) and 0 copy with do_tcp_sendpages (called by tcp_sendpage()). Copy is needed to keep a backup of data for case of undelivered segment. skb buffers in kernel can be cloned, but their data will stay in first (original) skb. Sendpage can take a page from other kernel part and keep it for backup (i think there is smth like COW)

Call paths (manually from lxr). Sending tcp_push_one/__tcp_push_pending_frames

tcp_sendmsg() <-  sock_sendmsg <- sock_readv_writev <- sock_writev <- do_readv_writev

tcp_sendpage() <- file_send_actor <- do_sendfile 

Receive tcp_recv_skb()

tcp_recvmsg() <-  sock_recvmsg <- sock_readv_writev <- sock_readv <- do_readv_writev

tcp_read_sock() <- ... spliceread for new kernels.. smth sendfile for older

In receive there can be 1 copy from kernel to user skb_copy_datagram_iovec (called from tcp_recvmsg). And for tcp_read_sock() there can be copy. It will call sk_read_actor callback function. If it correspond to file or memory, it may (or may not) copy data from DMA zone. If it is a other network, it has an skb of received packet and can reuse its data inplace.

For udp - receive = 1 copy -- skb_copy_datagram_iovec called from udp_recvmsg. transmit = 1 copy -- udp_sendmsg -> ip_append_data -> getfrag (seems to be ip_generic_getfrag with 1 copy from user, but may be a smth sendpage/splicelike without page copiing.)

Generically speaking there is must be at least 1 copy when sending from/receiving to userspace and 0 copy when using zero-copy (surprise!) with kernel-space source/target buffers for data. All headers are added without moving a packet, DMA-enabled (all modern) network card will take data from any place in DMA-enabled address space. For ancient cards PIO is needed, so there will be one more copy, from kernel space to PCI/ISA/smthelse I/O registers/memory.

UPD: In path from NIC (but this is nic-dependent, i checked 8139too) to tcp stack there is one more copy: from rx_ring to skb and the same for receive: from skb to tx buffer +1copy. You must to fill in ip and tcp header, but does skb contain them or place for them?

白日梦 2024-09-05 08:22:04

为了减少高性能的延迟,您应该拒绝使用内核驱动程序。通过用户空间驱动程序可以实现最小的延迟(MX 可以做到,Infinband 也可以做到)。

有一个相当不错(但有点过时)的 Linux 网络内部概述“A Map of the Networking Code in Linux Kernel 2.4.20”。 TCP/UDP 数据路径有一些方案。

使用原始套接字将使 tcp 数据包的路径稍微短一些(感谢您的想法)。内核中的 TCP 代码不会增加其延迟。但用户必须自己处理所有的tcp协议。有可能针对某些特定情况对其进行优化。集群代码不需要像默认 TCP/UDP 堆栈那样处理长距离链接或慢速链接。

我对这个主题也很感兴趣。

To reduce latency in High-performance, you should decline to use a kernel driver. Smallest latency will be achieved with user-space drivers (MX does it, Infinband may be too).

There is a rather good (but slightly outdated) overview of linux networking internals "A Map of the Networking Code in Linux Kernel 2.4.20". There are some schemes of TCP/UDP datapath.

Using raw sockets will make path of tcp packets a bit shorter (thanks for an idea). TCP code in kernel will not add its latency. But user must handle all tcp protocol itself. There is a some chance of optimizing it for some specific situations. Code for clusters don't require handling of long distance links or slow links as for default TCP/UDP stack.

I'm very interested in this theme too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文