启用 TCP_NODELAY 时的 Linux 环回性能
最近,我在运行一些比较网络性能与环回性能的性能测试时偶然发现了一个有趣的 TCP 性能问题。就我而言,网络性能超过了环回性能(1Gig 网络,同一子网)。在我处理的情况下,延迟至关重要,因此启用了 TCP_NODELAY。我们提出的最佳理论是 TCP 拥塞控制阻止了数据包。我们做了一些数据包分析,我们确实可以看到数据包被保留,但原因并不明显。现在的问题是...
1) 在什么情况下以及为什么通过环回进行的通信会比通过网络进行的通信慢?
2) 当尽可能快地发送时,为什么切换 TCP_NODELAY 对环回上的最大吞吐量的影响比对网络上的最大吞吐量的影响大得多?
3) 我们如何检测和分析 TCP 拥塞控制作为性能不佳的潜在解释?
4)对于这种现象的原因还有其他理论吗?如果是的话,有什么方法可以证明这个理论吗?
以下是由简单的点对点 c++ 应用程序生成的一些示例数据:
Transport Message Size (bytes) TCP NoDelay Send Buffer (bytes) Sender Host Receiver Host Throughput (bytes/sec) Message Rate (msgs/sec) TCP 128 On 16777216 HostA HostB 118085994 922546 TCP 128 Off 16777216 HostA HostB 118072006 922437 TCP 128 On 4096 HostA HostB 11097417 86698 TCP 128 Off 4096 HostA HostB 62441935 487827 TCP 128 On 16777216 HostA HostA 20606417 160987 TCP 128 Off 16777216 HostA HostA 239580949 1871726 TCP 128 On 4096 HostA HostA 18053364 141041 TCP 128 Off 4096 HostA HostA 214148304 1673033 UnixStream 128 - 16777216 HostA HostA 89215454 696995 UnixDatagram 128 - 16777216 HostA HostA 41275468 322464 NamedPipe 128 - - HostA HostA 73488749 574130
这里还有一些有用的信息:
- 我只在小型应用程序中看到此问题 HostA 和 HostB 的消息
- 相同 硬件套件(Xeon [email protected],总共 32 个内核/128 个 Gig Mem /1Gig Nics)
- 操作系统是 RHEL 5.4 内核 2.6.18-164.2.1.el5)
谢谢
I recently stumbled on an interesting TCP performance issue while running some performance tests that compared network performance versus loopback performance. In my case the network performance exceeded the loopback performance (1Gig network, same subnet). In the case I am dealing latencies are crucial, so TCP_NODELAY is enabled. The best theory that we have come up with is that TCP congestion control is holding up packets. We did some packet analysis and we can definitely see that packets are being held, but the reason is not obvious. Now the questions...
1) In what cases, and why, would communicating over loopback be slower than over the network?
2) When sending as fast as possible, why does toggling TCP_NODELAY have so much more of an impact on maximum throughput over loopback than over the network?
3) How can we detect and analyze TCP congestion control as a potential explanation for the poor performance?
4) Does anyone have any other theories as to the reason for this phenomenon? If yes, any method to prove the theory?
Here is some sample data generated by a simple point to point c++ app:
Transport Message Size (bytes) TCP NoDelay Send Buffer (bytes) Sender Host Receiver Host Throughput (bytes/sec) Message Rate (msgs/sec) TCP 128 On 16777216 HostA HostB 118085994 922546 TCP 128 Off 16777216 HostA HostB 118072006 922437 TCP 128 On 4096 HostA HostB 11097417 86698 TCP 128 Off 4096 HostA HostB 62441935 487827 TCP 128 On 16777216 HostA HostA 20606417 160987 TCP 128 Off 16777216 HostA HostA 239580949 1871726 TCP 128 On 4096 HostA HostA 18053364 141041 TCP 128 Off 4096 HostA HostA 214148304 1673033 UnixStream 128 - 16777216 HostA HostA 89215454 696995 UnixDatagram 128 - 16777216 HostA HostA 41275468 322464 NamedPipe 128 - - HostA HostA 73488749 574130
Here are a few more pieces of useful information:
- I only see this issue with small
messages - HostA and HostB both have the same
hardware kit (Xeon [email protected], 32 cores total/128 Gig Mem/1Gig Nics) - OS is RHEL 5.4 kernel 2.6.18-164.2.1.el5)
Thank You
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
1) 在什么情况下,以及为什么,通过环回进行的通信会比通过网络进行的通信慢?
环回将 tx+rx 的数据包设置+tcp chksum 计算放在同一台机器上,因此需要进行 2 倍的处理,而使用 2 台机器时,您可以在它们之间分配 tx/rx。这可能会对环回产生负面影响。
2) 当尽可能快地发送时,为什么切换 TCP_NODELAY 对环回最大吞吐量的影响比对网络的最大吞吐量大得多?
不知道你是如何做到的得出这个结论,但是环回与网络的实现方式非常不同,如果你试图将它们推向极限,你会遇到不同的问题。环回接口(如回答 1 中所述)会导致同一台机器上的 tx+rx 处理开销。另一方面,网卡在其循环缓冲区中可以有多少未完成的数据包等方面有一定的限制,这将导致完全不同的瓶颈(而且这在芯片与芯片之间也有很大差异,甚至与芯片之间的交换机也有很大差异)他们)
3)我们如何检测和分析 TCP 拥塞控制作为性能不佳的潜在解释?
只有在出现数据包丢失时,拥塞控制才会启动。您是否看到数据包丢失?否则,您可能会达到 TCP 窗口大小与网络延迟因素的限制。
4) 对于这种现象的原因还有其他理论吗?如果是的话,有什么方法可以证明这个理论吗?
我不明白你在这里提到的现象。我在你的表中看到的是你有一些带有大发送缓冲区的套接字 - 这可能是完全合法的。在一台快速的机器上,您的应用程序肯定能够生成比网络所能输出的更多的数据,所以我不确定您在这里将什么归类为问题。
最后一点:由于各种原因,小消息会对网络性能造成更大的影响,例如:
1) In what cases, and why, would communicating over loopback be slower than over the network?
Loopback puts the packet setup+tcp chksum calculation for both tx+rx on the same machine, so it needs to do 2x as much processing, while with 2 machines you split the tx/rx between them. This can have negative impact on loopback.
2) When sending as fast as possible, why does toggling TCP_NODELAY have so much more of an impact on maximum throughput over loopback than over the network?
Not sure how you've come to this conclusion, but the loopback vs network are implemented very differently, and if you try to push them to the limit, you will hit different issues. Loopback interfaces (as mentioned in answer to 1) cause tx+rx processing overhead on the same machine. On the other hand, NICs have a # of limits in terms of how many outstanding packets they can have in their circular buffers etc which will cause completely different bottlenecks (and this varies greatly from chip to chip too, and even from the switch that's between them)
3) How can we detect and analyze TCP congestion control as a potential explanation for the poor performance?
Congestion control only kicks in if there is packet loss. Are you seeing packet loss? Otherwise, you're probably hitting limits on the tcp window size vs network latency factors.
4) Does anyone have any other theories as to the reason for this phenomenon? If yes, any method to prove the theory?
I don't understand the phenomenon you refer to here. All I see in your table is that you have some sockets with a large send buffer - this can be perfectly legitimate. On a fast machine, your application will certainly be capable of generating more data than the network can pump out, so I'm not sure what you're classifying as a problem here.
One final note: small messages create a much bigger performance hit on your network for various reasons, such as:
1或2)我根本不知道为什么你要费心使用环回,我个人不知道它会多么接近地模仿真实的界面以及它的有效性如何。我知道 Microsoft 禁用了环回接口的 NAGLE(如果您关心的话)。看看这个链接,有一个关于这个的讨论。
3)我会仔细查看这两种情况下的前几个数据包,看看前五个数据包是否出现严重延迟。请参阅此处
1 or 2) I'm not sure why you're bothering to use loopback at all, I personally don't know how closely it will mimic a real interface and how valid it will be. I know that Microsoft disables NAGLE for the loopback interface (if you care). Take a look at this link, there's a discussion about this.
3) I would closely look at the first few packets in both cases and see if you're getting a severe delay in the first five packets. See here
这也是我面临的同样的问题。当在同一台 RHEL6 机器上运行的两个组件之间传输 2 MB 数据时,需要 7 秒才能完成。当数据量很大时,时间就无法接受。传输 10 MB 数据需要 1 分钟。
然后我尝试禁用
TCP_NODELAY
。它解决了问题当两个组件位于两台不同的机器中时,不会发生这种情况。
The is the same issue I faced,also. When transferring 2 MB of data between two components running in the same RHEL6 machine, it took 7 seconds to complete. When the data size is large, the time is not acceptable. It took 1 min to transfer 10 MB of data.
Then I have tried with
TCP_NODELAY
disabled. It solved the problemThis does not happen when the two components are in two different machines.