Winsock tcp/ip 套接字侦听但连接被拒绝,竞争条件?
这涉及两个自动化单元测试,每个测试都会启动一个 tcp/ip 服务器,该服务器创建一个非阻塞套接字,然后在 select() 上的循环中为连接并下载一些数据的客户端进行绑定()和监听()。
问题是它们在单独运行时工作得很好,但是当作为测试套件运行时,第二个测试客户端将无法与 WSACONNREFUSED 连接...
除非
它们之间有几秒钟的 Thread.Sleep() ???!!!
有趣的是,在任何失败后,每 1 秒都会有一个重试循环来进行连接。所以第二个测试会循环一段时间,直到10分钟后超时。
在此期间,netstat -na 显示服务器套接字处于侦听状态的正确端口号。那么如果处于listen状态呢?为什么它不接受连接?
在代码中,有一些日志消息显示 select 甚至从未让套接字准备好读取(这意味着当它应用于侦听套接字时准备接受连接)。
显然,问题必须与完成一个测试(这意味着套接字两端的 close() 和 shutdown() )与下一个测试的启动之间的某种竞争条件有关。
如果重试逻辑允许它在几秒钟后最终连接,这还不错。然而,它似乎变得“粘起来”,甚至不会重试。
然而,由于某种奇怪的原因,即使不断拒绝连接,侦听套接字也会说它处于侦听状态。
因此,这意味着 Windoze O/S 实际上捕获了 SYN 数据包并返回 RST 数据包(这意味着“连接被拒绝”)。
我唯一一次看到此错误是当代码出现问题导致数百个套接字陷入 TIME_WAIT 状态时。但这里的情况并非如此。 netstat 仅显示大约十几个套接字,其中在任何给定时刻只有 1 或 2 个处于 TIME_WAIT 状态。
请帮忙。
This involves two automated unit tests which each start up a tcp/ip server that creates a non-blocking socket then bind()s and listen()s in a loop on select() for a client that connects and downloads some data.
The catch is that they work perfectly when run separately but when run as a test suite, the second test client will fail to connect with WSACONNREFUSED...
UNLESS
there is a Thread.Sleep() of several seconds between them??!!!
Interestingly, there is retry loop every 1 second for connecting after any failure. So the second test loops for a while until timeout after 10 minutes.
During that time, netstat -na shows the correct port number is in the LISTEN state for the server socket. So if it is in the listen state? Why won't it accept the connection?
In the code, there are log messages that show the select NEVER even gets a socket ready to read (which means ready to accept a connection when it applies to a listening socket).
Obviously the problem must be related to some race condition between finishing one test which means close() and shutdown() on each end of the socket, and the start up of the next.
This wouldn't be so bad if the retry logic allowed it to connect eventually after a couple of seconds. However it seems to get "gummed up" and won't even retry.
However, for some strange reason the listening socket SAYS it's in the LISTEN state even through keeps refusing connections.
So that means it's the Windoze O/S which is actually catching the SYN packet and returning a RST packet (which means "Connection Refused").
The only other time I ever saw this error was when the code had a problem that caused hundreds of sockets to get stuck in TIME_WAIT state. But that's not the case here. netstat shows only about a dozen sockets with only 1 or 2 in TIME_WAIT at any given moment.
Please help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根本问题是在关闭套接字时,线程试图读取任何剩余的字节。这是作为一个单独的线程完成的,该线程将套接字的读取端保持打开状态,持续一段固定的毫秒时间,同时尝试重复读取任何数据。
该逻辑已被替换为更智能地读取任何数据并在读取返回 0 时正确关闭。因此它关闭得更快。
所以事实证明是我自己的代码中套接字关闭不当。
感谢您的帮助!
The fundamental problem was then in closing the socket, a thread was trying to read any remaining bytes. That was done as a separate thread which holds the read end of the socket open for a fixed time of milliseconds while trying repeatedly to read any data.
That logic has been replaced to more intelligently read any data and close properly when the read returns 0. So it closed much more rapidly.
So it turned out to be improper closing of the socket in my own code.
Thanks for all the help!
我在具有不同内核数量的各种 Windows 操作系统(XP 到 Windows 7)的构建机器上运行了大量类似的测试,但我从未发现这是一个问题。
我不认为侦听套接字转换为 TIME_WAIT 可能是您的问题;我当然从未见过它,并且我定期使用在
TIME_WAIT
延迟期内启动和停止服务器的同一端口运行客户端服务器测试。如果您在第一个服务器关闭其套接字之前启动第二个服务器(或者,如果套接字处于
TIME_WAIT
),那么当您尝试时,我希望您的第二个服务器会收到错误绑定()
。)。就我个人而言,我认为您接受连接的代码中更有可能存在问题 - 也就是说您的测试可能发现了错误;)
我们可以看一下您的侦听和接受循环之间的代码吗?
如果颠倒测试顺序,会出现问题吗?
客户端和服务器是否在同一台计算机上运行,如果不是,是否会发生变化?
等等
我有一些TCP测试工具 http:// www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html,如果您将测试系统设置为针对此示例服务器从该链接运行测试客户端http://www.lenholgate.com/blog/2005/11/ simple-echo-servers.html 您仍然看到您的问题吗? (也就是说,在你的测试系统中运行我的服务器和我的客户端,这样它就可以像运行你的东西一样运行,我的东西能工作吗?)。
I run lots of tests like this across build machines with various Windows operating systems (XP through Windows 7) with various numbers of cores and I've never seen it be a problem.
I don't believe that the listen socket transitioning to
TIME_WAIT
is likely to be your problem; I've certainly never seen it and I regularly run client server tests with the same port where I start and stop servers within theTIME_WAIT
delay period.If you were starting your second server before your first had closed its socket (or, if the socket were in
TIME_WAIT
) then I'd expect your second server to get an error when you attempted tobind()
.).Personally I think it's more likely that there's an issue in the code that you have that's accepting connections - that is your test might have found a bug ;)
Can we have a look at the code between your listen and the accept loop?
Do you have the problem if you reverse the order of the tests?
Are the client and server running on the same machine, does it change things if they aren't?
Etc.
I have some TCP test tools http://www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html, if you set up your test system to run the test client from that link against an example server from this one http://www.lenholgate.com/blog/2005/11/simple-echo-servers.html do you still see your problem? (That is, run my server with my client in your test system so that it runs it the same as it runs your stuff and does my stuff work?).
来自此 MSDN 站点:
我认为您可以将该值设置为 30 的最小值(尝试更小,但可能不起作用)
您可以查看 Winsock 程序员常见问题解答以获得更详细的解释。
From This MSDN site:
I think the minimum you can set the value to is 30 (try smaller but it might not work)
You can look at Winsock Programmer's FAQ for a more detailed explanation.