将突发数据分散开来
我正在尝试分散突发接收的数据。这意味着我有一些其他应用程序大量接收的数据。对于每个数据条目,我需要在某些服务器上执行一些额外的请求,在这些请求中我应该限制流量。因此,我尝试在下一个数据突发到达之前的时间内分散请求。
目前我正在使用 令牌桶 来分散数据。然而,因为我收到的数据已经被严重整形,所以我仍然要么填满待处理请求的队列,要么每当突发到来时我都会得到峰值。所以这个算法似乎没有做我需要的那种整形。
还有哪些其他算法可用于限制请求?我知道我有高负载的时候和低负载的时候,所以应用程序应该很好地处理这两者。
我不确定我是否真的能够解释我目前遇到的问题。如果您需要任何说明,请告诉我。
编辑:
我将尝试进一步澄清问题并解释为什么简单的速率限制器不起作用。
问题在于流量的突发性质以及突发在不同时间具有不同大小的事实。最恒定的是每次突发之间的延迟。因此,我们得到一堆数据记录进行处理,我们需要在下一组数据进入之前将它们尽可能均匀地分布。但是,我们不能 100% 确定下一组数据何时进入,只是大约,所以一个简单的 < code>将时间除以记录数无法正常工作。
速率限制不起作用,因为这种方式数据的传播不够充分。如果我们接近速率饱和,一切都很好,并且我们分布均匀(尽管这种情况不应该经常发生)。如果我们低于阈值,传播就会变得更糟。
我将举一个例子来使这个问题更清楚:
假设我们将流量限制为每秒 10 个请求,并且大约每 10 秒就会有新数据进入。
当我们在时间范围开始时获得 100 条记录时,我们每秒将查询 10 条记录,并且我们有一个完美的均匀分布。但是,如果我们只获取 15 条记录,我们将有 1 秒的时间查询 10 条记录,1 秒的时间查询 5 条记录,8 秒的时间查询 0 条记录,因此随着时间的推移,我们的流量水平非常不平等。相反,如果我们每秒查询 1.5 条记录会更好。然而,设置这个速率也会产生问题,因为新数据可能会更早到达,因此我们没有完整的 10 秒,并且 1.5 次查询是不够的。如果我们使用令牌桶,问题实际上会变得更糟,因为令牌桶允许突发在时间帧开始时通过。
然而这个例子过于简单化了,因为实际上我们不能完全知道在任何给定时刻待处理请求的数量,而只是一个上限。所以我们每次都必须根据这个数字来进行限制。
I am trying to spread out data that is received in bursts. This means I have data that is received by some other application in large bursts. For each data entry I need to do some additional requests on some server, at which I should limit the traffic. Hence I try to spread up the requests in the time that I have until the next data burst arrives.
Currently I am using a token-bucket to spread out the data. However because the data I receive is already badly shaped I am still either filling up the queue of pending request, or I get spikes whenever a bursts comes in. So this algorithm does not seem to do the kind of shaping I need.
What other algorithms are there available to limit the requests? I know I have times of high load and times of low load, so both should be handled well by the application.
I am not sure if I was really able to explain the problem I am currently having. If you need any clarifications, just let me know.
EDIT:
I'll try to clarify the problem some more and explain, why a simple rate limiter does not work.
The problem lies in the bursty nature of the traffic and the fact, that burst have a different size at different times. What is mostly constant is the delay between each burst. Thus we get a bunch of data records for processing and we need to spread them out as evenly as possible before the next bunch comes in. However we are not 100% sure when the next bunch will come in, just aproximately, so a simple divide time by number of records
does not work as it should.
A rate limiting does not work, because the spread of the data is not sufficient this way. If we are close to saturation of the rate, everything is fine, and we spread out evenly (although this should not happen to frequently). If we are below the threshold, the spreading gets much worse though.
I'll make an example to make this problem more clear:
Let's say we limit our traffic to 10 requests per seconds and new data comes in about every 10 seconds.
When we get 100 records at the beginning of a time frame, we will query 10 records each second and we have a perfect even spread. However if we get only 15 records we'll have one second where we query 10 records, one second where we query 5 records and 8 seconds where we query 0 records, so we have very unequal levels of traffic over time. Instead it would be better if we just queried 1.5 records each second. However setting this rate would also make problems, since new data might arrive earlier, so we do not have the full 10 seconds and 1.5 queries would not be enough. If we use a token bucket, the problem actually gets even worse, because token-buckets allow bursts to get through at the beginning of the time-frame.
However this example over simplifies, because actually we cannot fully tell the number of pending requests at any given moment, but just an upper limit. So we would have to throttle each time based on this number.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这听起来像是控制理论领域内的问题。具体来说,我认为 PID 控制器 可能会起作用。
解决该问题的第一个办法可能是将记录数除以下一批之前的估计时间。这就像 P 控制器 - 仅成比例。但是这样您就会冒着高估时间并建立一些未发送记录的风险。因此,尝试添加 I 项(积分)来解释累积误差。
如果批量大小的变化是随机的,我不确定您是否需要导数项。因此,尝试使用 PI 循环 - 您可能会在突发之间积累一些积压,但它将由 I 项处理。
如果积压是不可接受的,那么解决方案可能会更复杂......
This sounds like a problem within the domain of control theory. Specifically, I'm thinking a PID controller might work.
A first crack at the problem might be dividing the number of records by the estimated time until next batch. This would be like a P controller - proportional only. But then you run the risk of overestimating the time, and building up some unsent records. So try adding in an I term - integral - to account for built up error.
I'm not sure you even need a derivative term, if the variation in batch size is random. So try using a PI loop - you might build up some backlog between bursts, but it will be handled by the I term.
If it's unacceptable to have a backlog, then the solution might be more complicated...
如果没有其他限制,您应该做的是找出您适合发送额外请求的最大数据速率,并据此限制您的处理速度。然后监视正在发生的情况。如果这能快速满足您的所有请求,那就没有什么坏处。如果其持续处理水平不够快,那么您需要更多容量。
If there are no other constraints, what you should do is figure out the maximum data rate that you are comfortable with sending additional requests, and limit your processing speed according to that. Then monitor what is happening. If that gets through all of your requests quickly, then there is no harm . If its sustained level of processing is not fast enough, then you need more capacity.