实施性能良好的“发送”功能TCP 队列
为了不淹没远程端点,我的服务器应用程序将必须实现我希望发送的数据包的“发送”队列。
我使用 Windows Winsock、I/O 完成端口。
所以,我知道当我的代码调用“socket->send(.....)”时,我的自定义“send()”函数将检查数据是否已经“在线”(朝向该套接字) .
如果数据确实在线上,它只会将数据排队以便稍后发送。
如果线路上没有数据,它将调用 WSASend() 来真正发送数据。
到目前为止,一切都很好。
现在,我要发送的数据大小是不可预测的,因此我将其分成更小的块(例如 64 字节),以免浪费小数据包的内存,并排队/发送这些小块。
当 IOCP 给出关于我发送的数据包的“写入完成”完成状态时,我将发送队列中的下一个数据包。
这就是问题所在;速度实在是太慢了。 我实际上得到了,它在本地连接 (127.0.0.1) 上的速度约为 200kb/s。
所以,我知道我必须使用几个块(WSABUF 对象的数组)调用 WSASend(),这将提供更好的性能,但是,我一次会发送多少?
有推荐的字节大小吗?我确信答案是针对我的需求的,但我也确信有一些“一般”点可以开始。
还有其他更好的方法吗?
In order not to flood the remote endpoint my server app will have to implement a "to-send" queue of packets I wish to send.
I use Windows Winsock, I/O Completion Ports.
So, I know that when my code calls "socket->send(.....)" my custom "send()" function will check to see if a data is already "on the wire" (towards that socket).
If a data is indeed on the wire it will simply queue the data to be sent later.
If no data is on the wire it will call WSASend() to really send the data.
So far everything is nice.
Now, the size of the data I'm going to send is unpredictable, so I break it into smaller chunks (say 64 bytes) in order not to waste memory for small packets, and queue/send these small chunks.
When a "write-done" completion status is given by IOCP regarding the packet I've sent, I send the next packet in the queue.
That's the problem; The speed is awfully low.
I'm actually getting, and it's on a local connection (127.0.0.1) speeds like 200kb/s.
So, I know I'll have to call WSASend() with seveal chunks (array of WSABUF objects), and that will give much better performance, but, how much will I send at once?
Is there a recommended size of bytes? I'm sure the answer is specific to my needs, yet I'm also sure there is some "general" point to start with.
Is there any other, better, way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当然,如果您尝试发送数据的速度快于对等方处理数据的速度(由于链接速度或对等方读取和处理数据的速度),您只需要提供自己的队列。那么如果你想控制系统资源的使用量,你只需要求助于你自己的数据队列。如果你只有几个连接,那么这可能都是不必要的,如果你有 1000 个连接,那么这就是你需要关心的事情。这里要意识到的主要事情是,如果您在 Windows 上使用任何异步网络发送 API(托管或非托管),那么您将对发送缓冲区的生命周期的控制权交给接收应用程序和网络。请参阅此处了解更多信息细节。
一旦您决定确实需要为此烦恼,那么您就不必总是烦恼,如果对等方处理数据的速度比您生成数据的速度快,那么就无需通过在发送方上排队来减慢速度。您会发现您需要对数据进行排队,因为您发出的重叠写入将开始花费更长的时间,因为 TCP 堆栈由于流量控制问题而无法发送更多数据(请参阅http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm)。此时,您可能会使用不受限制的有限系统资源(非分页池内存和可以锁定的内存页数量都是有限的,并且(据我所知)两者都由挂起的套接字写入使用)。 ..
无论如何,已经足够了...我假设您在添加发送队列之前已经实现了良好的吞吐量?为了获得最大性能,您可能需要将 TCP 窗口大小设置为大于默认值(请参阅 http://msdn.microsoft.com/en-us/library/ms819736.aspx)并在连接上发布多个重叠写入。
假设您已经拥有良好的吞吐量,那么您需要在开始排队之前允许大量待处理的重叠写入,这可以最大限度地提高准备发送的数据量。一旦您拥有未完成的待处理写入的神奇数量,您就可以开始对数据进行排队,然后根据后续完成情况发送数据。当然,一旦您有任何数据排队,所有其他数据都必须排队。使数量可配置并配置文件,以了解在速度和所使用的资源(即您可以维护的并发连接数)之间进行权衡时最有效的方法。
我倾向于将整个数据缓冲区排队,该缓冲区将作为数据缓冲区队列中的单个条目发送,因为您使用的是 IOCP,这些数据缓冲区很可能已经进行了引用计数,以便在以下情况下轻松释放:完成发生而不是之前,因此排队过程变得更简单,因为您只需在数据位于队列中时保留对发送缓冲区的引用,并在发出发送后释放它。
就我个人而言,我不会通过使用具有多个 WSABUF 的分散/聚集写入来进行优化,除非您的基础工作正常并且您知道这样做实际上可以提高性能,我怀疑如果您已经有足够的数据待处理,那么它会这样做;但一如既往,测量一下你就会知道。
64字节太小了。
您可能已经看过这个,但我在这里写了有关该主题的文章: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html 虽然它对你来说可能太模糊了。
Of course you only need to resort to providing your own queue if you are trying to send data faster than the peer can process it (either due to link speed or the speed that the peer can read and process the data). Then you only need to resort to your own data queue if you want to control the amount of system resources being used. If you only have a few connections then it is likely that this is all unnecessary, if you have 1000s then it's something that you need to be concerned about. The main thing to realise here is that if you use ANY of the asynchronous network send APIs on Windows, managed or unmanaged, then you are handing control over the lifetime of your send buffers to the receiving application and the network. See here for more details.
And once you have decided that you DO need to bother with this you then don't always need to bother, if the peer can process the data faster than you can produce it then there's no need to slow things down by queuing on the sender. You'll see that you need to queue data because your write completions will begin to take longer as the overlapped writes that you issue cannot complete due to the TCP stack being unable to send any more data due to flow control issues (see http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm). At this point you are potentially using an unconstrained amount of limited system resources (both non-paged pool memory and the number of memory pages that can be locked are limited and (as far as I know) both are used by pending socket writes)...
Anyway, enough of that... I assume you already have achieved good throughput before you added your send queue? To achieve maximum performance you probably need to set the TCP window size to something larger than the default (see http://msdn.microsoft.com/en-us/library/ms819736.aspx) and post multiple overlapped writes on the connection.
Assuming you already HAVE good throughput then you need to allow a number of pending overlapped writes before you start queuing, this maximises the amount of data that is ready to be sent. Once you have your magic number of pending writes outstanding you can start to queue the data and then send it based on subsequent completions. Of course, as soon as you have ANY data queued all further data must be queued. Make the number configurable and profile to see what works best as a trade off between speed and resources used (i.e. number of concurrent connections that you can maintain).
I tend to queue the whole data buffer that is due to be sent as a single entry in a queue of data buffers, since you're using IOCP it's likely that these data buffers are already reference counted to make it easy to release then when the completions occur and not before and so the queuing process is made simpler as you simply hold a reference to the send buffer whilst the data is in the queue and release it once you've issued a send.
Personally I wouldn't optimise by using scatter/gather writes with multiple WSABUFs until you have the base working and you know that doing so actually improves performance, I doubt that it will if you have enough data already pending; but as always, measure and you will know.
64 bytes is too small.
You may have already seen this but I wrote about the subject here: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html though it's possibly too vague for you.