Linux 套接字：零拷贝本地、TCP/IP 远程

发布于 2024-12-23 08:38:20 字数 1744 浏览 2 评论 0原文

网络是我在操作系统方面最糟糕的领域，所以请原谅我问一个可能不完整的问题。我已经读了几个小时了，但它在我的脑海里萦绕不去。（对我来说，我觉得芯片设计比弄清楚网络协议更容易。）

我有一些通过套接字相互通信的网络服务。具体来说，套接字是使用 fd = socket(PF_INET, SOCK_STREAM, 0); 创建的，它会自动获取 TCP/IP。我需要这个作为基本案例，因为这些服务可能在不同的机器上运行。

但对于一个项目，我们试图将所有这些都压缩到一个基于 Atom Z530P 的动力不足的嵌入式“设备”中，因此在我看来，内存复制开销是我们可以优化的。我一直在这里阅读相关内容： data- link-access-and-zero-copy 和 Linux_packet_mmap 和 packet_mmap。

对于这种情况，可以创建如下所示的套接字：fd = socket(PF_PACKET, PF_RAW, 0);。还有很多其他事情要做，例如分配环形缓冲区、映射它们、将它们与套接字关联等。看起来您仅限于使用 sendto 和 recvfrom 以便传输数据。据我了解，由于套接字是本地的，因此不需要可靠的“流”类型套接字，因此原始套接字是合适的接口，并且我猜测使用了环形缓冲区以页粒度，其中每个数据包（或数据报）从页边界开始。

在我花费大量时间尝试进一步研究这个问题之前，我希望一些有帮助的人可以帮助我解决一些问题：

我应该期望从零拷贝套接字获得多少性能优势？我想我最后一次检查时，我们将最大从一个进程移动到另一个进程，最后移动到磁盘。在最基本的场景中，数据从捕获进程移动到一对多进程（其他进程可以监听流），再到写入磁盘的归档进程。这是两跳，不包括磁盘和内部内容。
Linux 是否会自动执行这些操作，并针对同一台计算机上运行的进程进行优化？
无论如何，我都会在 TCP 端口中监听套接字。我可以使用它们在进程之间建立连接，但仍然能够使用零拷贝吗？换句话说，我可以将 AF_INET 与 PF_PACKET 一起使用吗？
PF_PACKET 和 SOCK_RAW 是零拷贝套接字的唯一有效配置吗？
是否有任何好的示例代码可以使用 TCP/IP 零拷贝作为后备？
检测两个进程是否在同一台机器上的最简单或最好的方法是什么？他们知道彼此的 IP 地址，因此我可以比较并使用每个人的不同代码路径。有没有更简单的方法来做到这一点？
我可以在基于数据包的套接字上使用 write() 和 read() 吗？或者它们仅对流有效？（重写如何建立连接会比重写所有套接字代码更简单。）
我是否使事情过于复杂和/或优化了错误的事情？ OProfiler 告诉我，大部分 CPU 时间都花在两个地方：(1) zlib，(2) 内核，我无法分析内核，因为我使用的是 CentOS 6.2，它不提供 vmlinux。我假设内核时间是空闲时间和数据复制的组合，仅此而已。

预先感谢您的帮助！

原文

Networking is my worst area in operating systems, so forgive me for asking perhaps an incomplete question. I've been reading about this for a few hours, but it's kinda swimming in my head. (To me, I feel like chip design is easy compared to figuring out networking protocols.)

I have some networked services that communicate with each other via sockets. Specifically, the sockets are created with fd = socket(PF_INET, SOCK_STREAM, 0);, which automatically gets TCP/IP. I need this as the base case, because these services may be running on separate machines.

But for one project, we're trying to squeeze all of them into an underpowered embedded 'appliance', based on an Atom Z530P, so it seems to me that the memory copy overhead is something we could optimize out. I've been reading about that here: data-link-access-and-zero-copy and Linux_packet_mmap and packet_mmap.

For this case, one would create the socket something like this: fd = socket(PF_PACKET, PF_RAW, 0);. And there's a bunch of other stuff to do, like allocating ring buffers, mmapping them, associating them with the socket, etc. It looks like you're restricted to using sendto and recvfrom in order to transmit data. As I understand it, since the socket is local, you don't need a reliable "stream" type socket, so raw sockets is the appropriate interface, and I'm guessing that the ring buffer is used at page granularity, where each packet (or datagram) starts at a page boundary.

Before I spend a huge amount of time trying to investigate this further, I was hoping some helpful individuals might help me with some questions:

How much performance benefit should I expect to get here from zero-copy sockets? I think the last I checked, we were moving an maximum of like 40 MB/sec from one process to another and finally to the disk. In the most basic scenario, data moves from the capture process, to the one-to-many process (others can listen in on the stream), to the archiver process that writes to disk. That's two hops not counting the disk and internal stuff.
Does Linux do any of this automatically, optimizing for processes running on the same machine?
In any case, I would have listening sockets in TCP ports. Can I use those to make connections between processes yet still be able to use zero-copy? In other words, can I use AF_INET with PF_PACKET?
Is PF_PACKET with SOCK_RAW the only valid configuration for zero-copy sockets?
Is there any good sample code out there that will use zero-copy with TCP/IP as a fallback?
What's the simplest or best way to detect that the two processes are on the same machine? They know each other's IP addresses, so I could just compare and use different code paths for each. Is there a simpler way to do this?
Can I use write() and read() on a packet-based socket, or are those only valid for streams? (Rewriting how connections are made would be simpler then rewriting ALL of the socket code.)
Am I over-complicating things and/or optimizing the wrong thing? OProfiler tells me that most CPU time is spent in two places: (1) zlib, and (2) the kernel, which I can't profile since I'm using CentOS 6.2, which doesn't provide a vmlinux. I assume the kernel time is a combination of idle time and data copying and not much else.

Thanks in advance for the help!

分享到QQ

分享到微博