如何查找 TCP 连接不良的原因
我们正在开发一款在线游戏,玩家可以使用持久的 TCP 连接与服务器进行通信。持久化,它的生命周期是玩家会话的生命周期,如果连接关闭,玩家就会被从游戏中抛出(尽管客户端会尝试自动重新连接)。
现在的问题
,当然,我们办公室的一切都工作正常(连接到测试服务器和实时服务器),但我们的客户报告说,一些玩家经常断开连接(每隔几秒钟),并且他们自己也经历过这种情况(尽管他们的办公室是在同一栋大楼内)。
问题
如何找出这些断开连接的原因?是否是因为:
- 玩家的互联网连接状况不佳且无能为力。
- 玩家和服务器之间的距离(土耳其<->荷兰)太远。
- 服务器(CentOS 计算机)或数据中心出现问题。
- 服务器过载(尽管在低负载下也会发生)。
- 我们的软件有错误。
- 还是其他什么原因?
该软件是用Java 编写的。它会记录玩家何时断开连接,如果它主动踢他们(例如,不发送保持活动消息),它也会记录该情况。
已知数据
- 每当报告虚假断开连接并且我检查日志时,大多数时候我看不到该玩家被服务器软件主动踢出,只看到连接已关闭。
- 有一个内部监控服务,它与游戏服务器有一堆 localhost 连接,就像玩家一样,并且它不会断开连接。
其他
还有很多其他像我们这样的网络游戏。他们如何处理这个问题? (除非问题出在服务器/数据中心,否则解决方案是显而易见的)
- 他们使用UDP吗?我知道动作游戏是为了速度,但我认为 TCP 对于在线扑克和其他慢速游戏来说是正常的? (这并不是说对我们有帮助,我们的客户端软件是用 Flash 制作的,不支持 UDP)
- 是否可以进行一些 TCP 调整以使其更加宽松?
- 或者他们也会遇到这些断开连接,只是更透明地重新连接?
- 网上有这方面的信息吗?
We're developing an online game where players communicate with the server using a persistent TCP connection. Persistent as in, its lifetime is that of a player's session, and if the connection is closed, the player is thrown from the game (though the client will attempt to automatically reconnect).
Problem
Now, of course everything works fine in our office (connecting to both testing and live servers), but our client reports that some players get disconnected a lot (every few seconds), and that they experience it themselves too (though their offices are in the same building).
Question
How can I find out the cause of these disconnects? Is it because:
- Players have bad internet connections and it can't be helped.
- The distance between players and server (Turkey <-> Netherlands) is too long.
- Something is wrong with the server (a CentOS machine) or the datacenter.
- The server is overloaded (though it happens under low loads too).
- There is an error in our software.
- Or some other reason?
The software is written in Java. It logs when players are disconnected, and if it actively kicks them (e.g. for not sending keep-alive messages) it logs that too.
Known data
- Whenever a spurious disconnect is reported and I check the logs, most of the time I don't see that player getting actively kicked by the server software, only see that the connection has been closed.
- There is an internal monitoring service which has a bunch of localhost connections to the game server, the same way players do, and it doesn't get disconnected.
Others
There are many other online games like ours. How do they deal with this? (Unless the problem is in the server/datacenter, then the solution is obvious)
- Do they use UDP? I know action games do, for speed, but I presume TCP is normal for e.g. online poker and other slow games? (Not that that would help us, our client software is made in Flash, which doesn't support UDP)
- Is there some TCP tweaking that can be done to make it more lenient?
- Or do they get these disconnects as well, just reconnect more transparently?
- Is there information about this on the web?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会要求玩家允许您启用“匿名使用数据”,就像许多应用程序一样,定期将会话中的调试信息上传回给您。这就是你如何判断这些情况的方法。
从那里开始,当发生断开连接时,您将需要一个非常详细的日志。当发生断开连接时,捕获抛出的任何异常(并且不要忘记还记录 cause 通过调用
.getCause()
- 对进行尽可能多的调用。 getCause() 必要时直到您一路记录回到根本原因),以及将客户端日志与服务器端日志进行匹配所需的任何相关数据。您可能需要的信息包括会话 ID、游戏 ID、时间戳等。想一想,“假设我对连接的双方都有深入了解,我认为需要哪些信息才能解决此问题?”这就是要求用户上传使用情况和调试数据最终得到的结果。
从那里您应该能够弄清楚至少几种您可以控制它的情况 - 也就是说,您可以更改客户端/服务器代码以减轻一些问题。在某些情况下,如果问题是客户端的配置或有故障的设备(或者可能是您无法控制的设备之间的某个设备),您将不得不依赖强大的重新连接。
您永远不会将断开连接减少到零,但是在您看到足够多的案例之后,这些信息应该可以帮助您将断开连接的发生率减少到超出您单独控制的情况,此时您塑造网络的能力将最终结束,您将尽可能接近网络可靠性的“最佳情况”。
I would ask players to allow you to enable "anonymous usage data", like many apps do, to periodically upload debugging information from their sessions back to you. This is how you figure out these sorts of situations.
From there, what you'll need when a disconnect happens, is a pretty verbose log. When the disconnect happens, catch whatever exception was thrown (and don't forget to also log the cause via a call to
.getCause()
- making as many calls to.getCause()
as necessary until you've logged all the way back to the root cause), as well as any relevant data you need to match up the client log with the server-side logs. Information you'll likely need includes like session IDs, game IDs, timestamps, etc. Just think, "What information do I think I would need in order to troubleshoot this, assuming I had insight into both sides of the connection?" which is what you'll ultimately get with asking users to upload usage and debugging data.From there you should be able to figure out at least a few situations where you have control over it - that is, where you can change your client/server code in order to alleviate some of the problems. In some cases, where the problem is either a client's configuration or faulty equipment (or maybe a piece of equipment in between that neither of your control), you'll have to rely on robust re-connectivity.
You'll never reduce disconnects to zero, but this information, after you see enough cases of it, should help you reduce the occurrence of disconnects to the situations that are outside of your control alone, at which point your power to shape the network will ultimately end, and you'll be as close to a "best case scenario" with network reliability as you can be.