select() 无限期挂起
我有一个在嵌入式 Linux(较旧的内核,2.6.18)上运行的应用程序。我用的是Live555。有时,当摄像机负载过重时,我的 RTSP 服务器(使用 Live555 构建)会无限期挂起——除了重置应用程序之外,无论进行多少连接或哄骗似乎都无法让它摆脱困境。
我将问题范围缩小到了这段代码:
static int blockUntilReadable(UsageEnvironment& env,
int socket, struct timeval* timeout) {
int result = -1;
do {
fd_set rd_set;
FD_ZERO(&rd_set);
if (socket < 0) break;
FD_SET((unsigned) socket, &rd_set);
const unsigned numFds = socket+1;
result = select(numFds, &rd_set, NULL, NULL, timeout); <--HANG
超时当然是一个 NULL 指针,它指示它应该阻塞,直到其中一个套接字可读为止。问题是:我是否连接到 RTSP 服务器并不重要——它只是无限期地阻塞。
我执行了 netstat -an,它总是输出如下内容:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:5222 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5800 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5000 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5802 0.0.0.0:* LISTEN
tcp 21 0 0.0.0.0:554 0.0.0.0:* LISTEN
当它处于失败状态时,我总是在 Recv-Q 上看到 21,这是“连接到此套接字的用户程序未复制的字节数”。
有谁知道可能会发生什么事情,或者我如何解决这个问题?
I have an application that runs on embedded linux (older kernel, 2.6.18). I'm using Live555. Occasionally when the camera is heavily loaded, my RTSP server (built using Live555) will hang indefinitely--no amount of connecting or cajoling seems to get it to snap out of it, short of resetting the application.
I narrowed the hang down to this code:
static int blockUntilReadable(UsageEnvironment& env,
int socket, struct timeval* timeout) {
int result = -1;
do {
fd_set rd_set;
FD_ZERO(&rd_set);
if (socket < 0) break;
FD_SET((unsigned) socket, &rd_set);
const unsigned numFds = socket+1;
result = select(numFds, &rd_set, NULL, NULL, timeout); <--HANG
timeout is, of course, a NULL pointer which indicates it should block until one of the sockets is readable. Problem is: it doesn't matter if I connect to the RTSP server--it simply blocks indefinitely.
I did a netstat -an, and it always outputs something like:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:5222 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5800 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5000 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:5802 0.0.0.0:* LISTEN
tcp 21 0 0.0.0.0:554 0.0.0.0:* LISTEN
When it's in a failed state, I always see 21 on the Recv-Q, which is "The count of bytes not copied by the user program connected to this socket."
Does anyone have any idea what might be going south, or how I could troubleshoot this issue?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该代码看起来非常可靠。我有点好奇为什么你要转换为
unsigned int
,但这不会造成任何伤害。一些想法:
它没有挂在你认为的地方。希望您已经仔细检查过这一点。 (再检查一遍?)
你的netstat解释是错误的。正如手册页所述,该部分用于“已建立”套接字 - 您的套接字是一个侦听器,即下一句:“侦听:自内核 2.6.18 起,此列包含当前的 syn 积压工作。”
这看起来像是一个巨大的积压......这让我认为你没有接受(),也许是因为你陷入了选择()。那是你的监听套接字上的select(),对吗?
最后,仔细检查您是否在正确的套接字上调用 select()。即,打印出该套接字参数,看看它是否应该是这样。
本质上,验证:1)它挂在 select() 中,2) select 的参数是否正确。我怀疑这两者之一不是真的。
That code looks pretty solid. I'm a little curious as to why you're casting to
unsigned int
, but it shouldn't hurt anything.Some thoughts:
It's not hanging where you think it is. Hopefully you've double/triple checked this. (Check it again?)
Your netstat interpretation is wrong. That part, as the man page notes, is for "Established" sockets - yours is a listener, which is the next sentence: "Listening: Since Kernel 2.6.18 this column contains the current syn backlog."
That looks like a huge backlog... Which leads me to think you're not accept()-ing, perhaps because you're stuck in select(). That is the select() on your listening socket right?
Last, double check that you're calling select() on the right socket. ie, print out that socket arg, and see if it is what it should be.
Essentially, verify: 1) it's is hanging in select() and 2) the arguments to select are correct. I suspect one of those two are not true.