IO 繁重操作与网络应用程序侦听 UDP 和 SCTP 数据之间的问题
我们有一个使用两种类型套接字的应用程序:侦听 UDP 套接字和活动 SCTP 套接字。
在某些时候,我们在同一台机器上运行具有高 IO 活动的脚本(例如“dd、tar、...”),大多数时候,当这些 IO 密集型应用程序运行时,我们似乎会遇到以下问题
- :套接字关闭
- SCTP 套接字仍然处于活动状态,我们可以在 /proc/net/sctp/assocs 中看到它,但是不再从此套接字接收任何流量(直到我们重新启动应用程序)
为什么这些 I/O 操作会影响基于网络的应用程序以这样的方式?
有没有什么内核配置可以避免这些问题?
我本来预计 UDP 上会丢失一些数据包,并且 SCTP 套接字上会重试一些数据包,但不是这种行为。
该应用程序运行在具有 64 位 4 四核 CPU 和 RHEL 操作系统的服务器上
# uname -a
Linux server1 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
We have an application that uses two types of socket, a listening UDP socket and an active SCTP socket.
At certain time we have scripts running on the same machine that have high IO activities (such as "dd, tar, ..."), most of the time when these IO heavy applications run we seem to have the following problems:
- The UDP socket closes
- The SCTP socket is still alive and we can see it in /proc/net/sctp/assocs however no traffic is received anymore from this socket (until we restart the application)
Why are these I/O operations affecting the network based application in such a way?
Is there any kernel configurations to avoid these problems?
I would have expected some packets to be lost on the UDP and some retries on the SCTP socket but not this behavior.
The application is running on a server with 64-bits 4 quad core CPU and RHEL OS
# uname -a
Linux server1 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当您说 UDP 套接字关闭时,您到底是什么意思?您尝试
发送
但失败了?对于 SCTP,您可以在这些 I/O 操作运行时收集wireshark 或pcap 跟踪(最好在对等方上运行wireshark)吗?我的猜测是(无需查看代码的有根据的猜测),当这些 I/O 操作出现时,您的进程就会缺乏 CPU 时间。另一端发送
SCTP 心跳消息
,但没有得到回复。或者,如果数据正在流动,则对端不会收到任何 SACKS,因为它们尚未被您端的 SCTP 堆栈处理。因此,对等方会在内部中止关联并停止向您发送数据(因为它看到所有路径都处于关闭状态,因此不会发送 ABORT。在这种情况下,您的 SCTP 堆栈仍会认为关联处于活动状态)。
尝试确认心跳超时、RTO 超时、SACK 超时、最大路径重传和最大路径重传的值是多少。对端最大关联重传。我没有使用过内核 SCTP,但 sysctl 应该能够为您提供这些值。
无论哪种方式,当您观察到此问题时收集 pcap 痕迹将使我们更好地了解问题所在。我希望它有帮助。
When you say the UDP socket closes, what exactly do you mean? You try
send
and it fails?For SCTP, can you collect wireshark or pcap traces at the time these I/O operations runs (preferably run wireshark on the peer)? My guess is (an educated guess without looking at the code), when these I/O operations comes into the picture, your process gets starved for CPU time. The other end sends
SCTP Heartbeat messages
to which it gets no replies. Or if data was flowing, the peer end is not receiving anySACKS
as they have not yet been processed by the SCTP stack at your end.The peer, therefore, aborts the association internally and stops sending you data (since it sees all the paths as down ergo does not send ABORT. In such a case, your SCTP stack will still think Association is alive).
Try to confirm what are the values for
Heartbeat timeout, RTO timeout,SACK timeout, maximum Path retransmission & max Association retransmission
at the peer end. I haven't worked with Kernel SCTP but sysctl should be able to give you those values.Either ways, collecting pcap traces when you observe this problem would give us much better insight to what is going wrong. I hope it helps.
以下是我要研究的一些内容:
当脚本未运行时,UDP 套接字上加载了什么?是连续的还是突发的?当脚本不运行时,套接字是否会自发关闭?从套接字读取的数据发生了什么?有多少从套接字生成的数据(原始数据或处理后的数据)被写入磁盘?您能否监控 CPU、网络和磁盘 IO 利用率以查看其中是否存在饱和情况?运行 IO 操作的脚本是否可以以较低的优先级运行,或者相反,运行 UDP 套接字的进程是否可以以较高的优先级运行?
Here are some things I'd look into:
What is loading on the UDP socket when the scripts are not running? Is it continuous or bursty? Does the socket ever spontaneously close when the scripts are not running? What is happening to the data being read off the socket? How much data generated off of the socket (raw or processed) is being written to disk? Can you monitor CPU, network, and disk IO utilization to see if any of them are saturating? Can the scripts running the IO operations be run at a lower priority or, conversely, can the process running the UDP socket be run at a higher priority?
很多人不检查的一件事是发送时的返回值,并且他们不检查
recv
上的EINTR
等错误条件。也许繁重的 IO 负载导致您的某些send
或recv
被中断,并且您的应用程序将错误视为硬错误并关闭套接字,而无需您意识到错误是暂时的。我见过这种情况发生,你一定应该通过提高日志级别并查看你的应用程序是否意外调用关闭来检查它。
One thing allot of people don't check for is return values on sends, and they don't check for error conditions like
EINTR
onrecv
's. Maybe the heavy IO load is causing some of yoursend
's orrecv
's to get interrupted and your app is seeing the errors as a hard errors and closing the socket without you realizing that the errors are transient.I've seen this kind of thing happen and you should definitely check for it by cranking up your log level and seeing if your app is calling close unexpectedly.