vmsplice() 和 TCP
在最初的 vmsplice()
实现中,建议如果您有用户态缓冲区是管道中可容纳的最大页面数的 2 倍,缓冲区后半部分成功的 vmsplice() 将保证内核使用缓冲区的前半部分完成。
但事实并非如此,特别是对于 TCP,内核页面将一直保留到收到对方的 ACK 为止。解决这个问题留待以后的工作,因此对于 TCP,内核仍然必须从管道复制页面。
vmsplice()
具有 SPLICE_F_GIFT
选项可以解决这个问题,但问题是这暴露了另外两个问题 - 如何有效地从内核获取新页面,以及如何减少缓存垃圾。第一个问题是 mmap 需要内核清除页面,第二个问题是尽管 mmap 可能使用花哨的 内核中的 kscrubd 功能,可增加进程的工作集(缓存垃圾回收)。
基于此,我有以下问题:
- 通知用户区安全重用页面的当前状态是什么?我对将页面 splice()d 到套接字(TCP)特别感兴趣。过去 5 年里发生过什么事情吗?
mmap
/vmsplice
/splice
/munmap
是 TCP 服务器中零复制的当前最佳实践吗?今天我们有更好的选择吗?
In the original vmsplice()
implementation, it was suggested that if you had a user-land buffer 2x the maximum number of pages that could fit in a pipe, a successful vmsplice() on the second half of the buffer would guarantee that the kernel was done using the first half of the buffer.
But that was not true after all, and particularly for TCP, the kernel pages would be kept until receiving ACK from the other side. Fixing this was left as future work, and thus for TCP, the kernel would still have to copy the pages from the pipe.
vmsplice()
has the SPLICE_F_GIFT
option that sort-of deals with this, but the problem is that this exposes two other problems - how to efficiently get fresh pages from the kernel, and how to reduce cache trashing. The first issue is that mmap requires the kernel to clear the pages, and the second issue is that although mmap might use the fancy kscrubd feature in the kernel, that increases the working set of the process (cache trashing).
Based on this, I have these questions:
- What is the current state for notifying userland about the safe re-use of pages? I am especially interested in pages splice()d onto a socket (TCP). Did anything happen during the last 5 years?
- Is
mmap
/vmsplice
/splice
/munmap
the current best practice for zero-copying in a TCP server or have we better options today?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,由于 TCP 套接字在不确定的时间内保留页面,因此您无法使用示例代码中提到的双缓冲方案。另外,在我的用例中,页面来自循环缓冲区,因此我无法将页面赠送给内核并分配新页面。我可以验证收到的数据中是否存在数据损坏。
我采取轮询 TCP 套接字发送队列的级别,直到它耗尽到 0。这修复了数据损坏,但不是最理想的,因为将发送队列耗尽到 0 会影响吞吐量。
Yes, due to the TCP socket holding on to the pages for an indeterminate time you cannot use the double-buffering scheme mentioned in the example code. Also, in my use case the pages come from circular buffer so I cannot gift the pages to the kernel and alloc fresh pages. I can verify that I am seeing data corruption in the received data.
I resorted to polling the level of the TCP socket's send queue until it drains to 0. This fixes data corruption but is suboptimal because draining the send queue to 0 affects throughput.