使用“待发送”限制 TCP 发送队列和其他设计问题
这个问题是我过去几天问过的另外两个问题的结果。
我正在创建一个新问题,因为我认为它与我对如何控制发送/接收流程的理解中的“下一步”有关,这是我尚未得到完整答案的问题。
其他相关问题是:
IOCP 文档解释问题 - 缓冲区所有权歧义
非阻塞 TCP 缓冲区问题
总之,我正在使用 Windows I/O 完成端口。
我有几个线程处理来自完成端口的通知。
我相信这个问题是与平台无关的,并且会得到相同的答案,就像在 *nix、*BSD、Solaris 系统上做同样的事情一样。
所以,我需要有自己的流量控制系统。很好。
所以我发送了很多次。 由于接收方限制为 X 数量,我如何知道何时开始对发送进行排队?
让我们举个例子(最接近我的问题):FTP 协议。
我有两台服务器;一个位于 100Mb 链路上,另一个位于 10Mb 链路上。
我命令 100Mb 的一个向另一个(10Mb 链接的)发送一个 1GB 的文件。最终平均传输率为 1.25MB/s。
发送方(100Mb 链接的发送方)如何知道何时暂停发送,以便较慢的发送方不会被淹没? (在这种情况下,“待发送”队列是硬盘上的实际文件)。
另一种提问方式:
我可以从远程端收到“保留发送”通知吗?它是TCP内置的还是所谓的“可靠网络协议”需要我这样做?
我当然可以将发送限制为固定的字节数,但这对我来说听起来不太正确。
同样,我有一个循环,其中有许多发送到远程服务器,并且在某个时刻,在该循环内我必须确定是否应该对该发送进行排队,或者我可以将其传递到传输层(TCP) 。
我该怎么做?你会怎么办?当然,当我从 IOCP 收到发送已完成的完成通知时,我将发出其他待处理的发送,这是明确的。
另一个与此相关的设计问题:
由于我要使用带有发送队列的自定义缓冲区,并且当“发送完成”通知到达时,这些缓冲区将被释放以供重用(因此不使用“删除”关键字),因此我必须使用该缓冲池上的相互排斥。
使用互斥体会减慢速度,所以我一直在想;为什么不让每个线程都有自己的缓冲区池,这样访问它(至少在获取发送操作所需的缓冲区时)将不需要互斥体,因为它仅属于该线程。
缓冲区池位于线程本地存储 (TLS) 级别。
没有相互池意味着不需要锁,意味着更快的操作,但也意味着应用程序使用更多的内存,因为即使一个线程已经分配了 1000 个缓冲区,另一个正在发送且需要 1000 个缓冲区来发送内容的线程也需要分配这些都是自己的。
另一个问题:
假设我在“待发送”队列中有缓冲区 A、B、C。
然后我收到一条完成通知,告诉我接收方已收到 15 个字节中的 10 个。我应该从缓冲区的相对偏移量重新发送,还是TCP会为我处理它,即完成发送?如果我应该这样做,我能否确信该缓冲区是队列中的“下一个要发送的”缓冲区,或者例如它可以是缓冲区 B?
这是一个很长的问题,我希望没有人受到伤害(:
我很高兴看到有人花时间在这里回答。我保证我会双重投票给他!(:
谢谢大家!
This question is the result of two other questions I've asked in the last few days.
I'm creating a new question because I think it's related to the "next step" in my understanding of how to control the flow of my send/receive, something I didn't get a full answer to yet.
The other related questions are:
An IOCP documentation interpretation question - buffer ownership ambiguity
Non-blocking TCP buffer issues
In summary, I'm using Windows I/O Completion Ports.
I have several threads that process notifications from the completion port.
I believe the question is platform-independent and would have the same answer as if to do the same thing on a *nix, *BSD, Solaris system.
So, I need to have my own flow control system. Fine.
So I send send and send, a lot. How do I know when to start queueing the sends, as the receiver side is limited to X amount?
Let's take an example (closest thing to my question): FTP protocol.
I have two servers; One is on a 100Mb link and the other is on a 10Mb link.
I order the 100Mb one to send to the other one (the 10Mb linked one) a 1GB file. It finishes with an average transfer rate of 1.25MB/s.
How did the sender (the 100Mb linked one) knew when to hold the sending, so the slower one wouldn't be flooded? (In this case the "to-be-sent" queue is the actual file on the hard-disk).
Another way to ask this:
Can I get a "hold-your-sendings" notification from the remote side? Is it built-in in TCP or the so called "reliable network protocol" needs me to do so?
I could of course limit my sendings to a fixed number of bytes but that simply doesn't sound right to me.
Again, I have a loop with many sends to a remote server, and at some point, within that loop I'll have to determine if I should queue that send or I can pass it on to the transport layer (TCP).
How do I do that? What would you do? Of course that when I get a completion notification from IOCP that the send was done I'll issue other pending sends, that's clear.
Another design question related to this:
Since I am to use a custom buffers with a send queue, and these buffers are being freed to be reused (thus not using the "delete" keyword) when a "send-done" notification has been arrived, I'll have to use a mutual exlusion on that buffer pool.
Using a mutex slows things down, so I've been thinking; Why not have each thread have its own buffers pool, thus accessing it , at least when getting the required buffers for a send operation, will require no mutex, because it belongs to that thread only.
The buffers pool is located at the thread local storage (TLS) level.
No mutual pool implies no lock needed, implies faster operations BUT also implies more memory used by the app, because even if one thread already allocated 1000 buffers, the other one that is sending right now and need 1000 buffers to send something will need to allocated these to its own.
Another issue:
Say I have buffers A, B, C in the "to-be-sent" queue.
Then I get a completion notification that tells me that the receiver got 10 out of 15 bytes. Should I re-send from the relative offset of the buffer, or will TCP handle it for me, i.e complete the sending? And if I should, can I be assured that this buffer is the "next-to-be-sent" one in the queue or could it be buffer B for example?
This is a long question and I hope none got hurt (:
I'd loveeee to see someone takes the time to answer here. I promise I'll double-vote for him! (:
Thank you all!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先:我会将此作为单独的问题提出。这样你更有可能得到答案。
我已经在我的博客上谈到了大部分内容: http://www.lenholgate.com 但从那时起您已经给我发电子邮件说您读过我的博客,您知道...
TCP 流量控制问题是这样的,因为您正在发布异步写入,并且这些写入每个都使用资源,直到它们完成(请参阅 此处)。在写入挂起期间,需要注意各种资源使用问题,而数据缓冲区的使用是其中最不重要的;您还将使用一些非分页池,这是一种有限的资源(尽管在 Vista 和更高版本中比以前的操作系统有更多的可用资源),您还将在写入期间锁定内存中的页面,并且有操作系统可以锁定的页面总数的限制。请注意,非分页池使用和页面锁定问题都没有在任何地方得到很好的记录,但是一旦遇到这些问题,您就会开始看到 ENOBUFS 写入失败。
由于这些问题,让未控制的写入数量不受控制是不明智的。如果您要发送大量数据并且没有应用程序级别的流量控制,那么您需要注意,如果您发送数据的速度快于连接另一端可以处理的速度,或者快于链接速度,那么您将开始使用大量上述资源,因为由于 TCP 流控制和窗口问题,您的写入需要更长的时间才能完成。阻塞套接字代码不会出现这些问题,因为当 TCP 堆栈由于流量控制问题而无法再写入时,写入调用就会被阻塞;使用异步写入时,写入完成,然后等待。通过阻塞代码,阻塞可以为您处理流量控制;使用异步写入,您可以继续循环,越来越多的数据都在等待 TCP 堆栈发送...
无论如何,因此,使用 Windows 上的异步 I/O,您应该始终拥有某种形式的显式流控制。因此,您可以在协议中添加应用程序级别的流量控制,也许可以使用 ACK,这样您就知道数据何时到达另一端,并且在任何时候只允许一定数量的数据未完成,或者如果您无法添加到在应用程序级协议中,您可以通过使用写入完成来驱动事物。诀窍是允许每个连接一定数量的未完成写入完成,并在达到限制后对数据进行排队(或者只是不生成数据)。然后,当每次写入完成时,您可以生成一个新的写入......
恕我直言,您关于池化数据缓冲区的问题是,您现在的过早优化。到达系统正常工作的程度,并且您已经对系统进行了分析,发现缓冲池上的争用是最重要的热点,然后解决它。我发现每个线程缓冲池的工作效果不太好,因为跨线程的分配和释放的分配往往不像您需要的那样平衡。我在我的博客上对此进行了更多讨论: http://www.lenholgate.com/blog/2010/05/performance-comparisons-for-recent-code-changes.html
您关于部分写入完成的问题(您发送 100 个字节,完成返回并说您只发送了 95) 在实践中恕我直言,这并不是真正的问题。如果您到达此位置并且有多个未完成的写入,那么您无能为力,后续写入可能会很好地工作,并且您将在预期发送的内容中丢失字节;但是a)我从未见过这种情况发生,除非您已经遇到了上面详细介绍的资源问题,b)如果您已经在该连接上发布了更多写入操作,那么您无能为力,因此只需中止连接 - 请注意,这是为什么我总是在它们将运行的硬件上分析我的网络系统,并且我倾向于在我的代码中设置限制,以防止达到操作系统资源限制(Vista之前的操作系统上的坏驱动程序通常会蓝屏,如果可以的话)不会获得非分页池,因此如果您不仔细注意这些细节,您可能会关闭一个盒子)。
请下次单独提问。
Firstly: I'd ask this as separate questions. You're more likely to get answers that way.
I've spoken about most of this on my blog: http://www.lenholgate.com but then since you've already emailed me to say that you read my blog you know that...
The TCP flow control issue is such that since you are posting asynchronous writes and these each use resources until they complete (see here). During the time that the write is pending there are various resource usage issues to be aware of and the use of your data buffer is the least important of them; you'll also use up some non-paged pool which is a finite resource (though there is much more available in Vista and later than previous operating systems), you'll also be locking pages in memory for the duration of the write and there's a limit to the total number of pages that the OS can lock. Note that both the non-paged pool usage and page locking issues aren't something that's documented very well anywhere, but you'll start seeing writes fail with ENOBUFS once you hit them.
Due to these issues it's not wise to have an uncontrolled number of writes pending. If you are sending a large amount of data and you have a no application level flow control then you need to be aware that if you send data faster than it can be processed by the other end of the connection, or faster than the link speed, then you will begin to use up lots and lots of the above resources as your writes take longer to complete due to TCP flow control and windowing issues. You don't get these problems with blocking socket code as the write calls simply block when the TCP stack can't write any more due to flow control issues; with async writes the writes complete and are then pending. With blocking code the blocking deals with your flow control for you; with async writes you could continue to loop and more and more data which is all just waiting to be sent by the TCP stack...
Anyway, because of this, with async I/O on Windows you should ALWAYS have some form of explicit flow control. So, you either add application level flow control to your protocol, using an ACK, perhaps, so that you know when the data has reached the other side and only allow a certain amount to be outstanding at any one time OR if you cant add to the application level protocol, you can drive things by using your write completions. The trick is to allow a certain number of outstanding write completions per connection and to queue the data (or just don't generate it) once you have reached your limit. Then as each write completes you can generate a new write....
Your question about pooling the data buffers is, IMHO, premature optimisation on your part right now. Get to the point where your system is working properly and you have profiled your system and found that the contention on your buffer pool is the most important hot spot and THEN address it. I found that per thread buffer pools didn't work so well as the distribution of allocations and frees across threads tends not to be as balanced as you'd need to that to work. I've spoken about this more on my blog: http://www.lenholgate.com/blog/2010/05/performance-comparisons-for-recent-code-changes.html
Your question about partial write completions (you send 100 bytes and the completion comes back and says that you have only sent 95) isn't really a problem in practice IMHO. If you get to this position and have more than the one outstanding write then there's nothing you can do, the subsequent writes may well work and you'll have bytes missing from what you expected to send; BUT a) I've never seen this happen unless you have already hit the resource problems that I detail above and b) there's nothing you can do if you have already posted more writes on that connection so simply abort the connection - note that this is why I always profile my networking systems on the hardware that they will run on and I tend to place limits in MY code to prevent the OS resource limits ever being reached (bad drivers on pre Vista operating systems often blue screen the box if they can't get non paged pool so you can bring a box down if you don't pay careful attention to these details).
Separate questions next time, please.
Q1.大多数 API 都会在您上次写入并且再次写入可用后为您提供“可以写入”事件(如果您未能用上次发送填充发送缓冲区的主要部分,则可能会立即发生)。
通过完成端口,它将作为“新数据”事件到达。将新数据视为“读取正常”,因此还有一个“写入正常”事件。 API 之间的名称有所不同。
Q2。如果每块数据的互斥锁获取的内核模式转换对您造成伤害,我建议您重新考虑您正在做的事情。最多需要 3 微秒,而你的线程调度程序切片在 Windows 上可能长达 60 毫秒。
在极端情况下可能会造成伤害。如果您认为您正在编写极端通信程序,请再次询问,我保证会告诉您所有相关信息。
Q1. Most APIs will give you "write is possible" event, after you last wrote and writing is available again (can happen immediately if you failed to fill major part of send buffer with the last send).
With completion port, it will arrive just as "new data" event. Think of new data as "read Ok", so there's also a "write ok" event. Names differ between the APIs.
Q2. If a kernel mode transition for mutex acquisition per chunk of data hurts you, I recommend rethinking what you are doing. It takes 3 microseconds at most, while your thread scheduler slice may be as big as 60 milliseconds on windows.
It may hurt in extreme cases. If you think you are programming extreme communications, please ask again, and I promise to tell you all about it.
为了解决您关于何时知道减速的问题,您似乎缺乏对 TCP 拥塞机制的了解。 “慢启动”就是您所说的,但这并不是您的措辞方式。慢启动正是这样的——开始缓慢,然后变得更快,直到另一端愿意的速度,有线速度,等等。
关于你问题的其余部分,帕维尔的回答应该足够了。
To address your question about when it knew to slow down, you seem to lack an understanding of TCP congestion mechanisms. "Slow start" is what you're talking about, but it's not quite how you've worded it. Slow start is exactly that -- starts off slow, and gets faster, up to as fast as the other end is willing to go, wire line speed, whatever.
With respect to the rest of your question, Pavel's answer should suffice.