将突发数据分散开来

发布于 2024-11-29 03:26:18 字数 1149 浏览 1 评论 0原文

我正在尝试分散突发接收的数据。这意味着我有一些其他应用程序大量接收的数据。对于每个数据条目，我需要在某些服务器上执行一些额外的请求，在这些请求中我应该限制流量。因此，我尝试在下一个数据突发到达之前的时间内分散请求。

目前我正在使用令牌桶来分散数据。然而，因为我收到的数据已经被严重整形，所以我仍然要么填满待处理请求的队列，要么每当突发到来时我都会得到峰值。所以这个算法似乎没有做我需要的那种整形。

还有哪些其他算法可用于限制请求？我知道我有高负载的时候和低负载的时候，所以应用程序应该很好地处理这两者。

我不确定我是否真的能够解释我目前遇到的问题。如果您需要任何说明，请告诉我。

编辑：

我将尝试进一步澄清问题并解释为什么简单的速率限制器不起作用。

问题在于流量的突发性质以及突发在不同时间具有不同大小的事实。最恒定的是每次突发之间的延迟。因此，我们得到一堆数据记录进行处理，我们需要在下一组数据进入之前将它们尽可能均匀地分布。但是，我们不能 100% 确定下一组数据何时进入，只是大约，所以一个简单的 < code>将时间除以记录数无法正常工作。

速率限制不起作用，因为这种方式数据的传播不够充分。如果我们接近速率饱和，一切都很好，并且我们分布均匀（尽管这种情况不应该经常发生）。如果我们低于阈值，传播就会变得更糟。

我将举一个例子来使这个问题更清楚：

假设我们将流量限制为每秒 10 个请求，并且大约每 10 秒就会有新数据进入。

当我们在时间范围开始时获得 100 条记录时，我们每秒将查询 10 条记录，并且我们有一个完美的均匀分布。但是，如果我们只获取 15 条记录，我们将有 1 秒的时间查询 10 条记录，1 秒的时间查询 5 条记录，8 秒的时间查询 0 条记录，因此随着时间的推移，我们的流量水平非常不平等。相反，如果我们每秒查询 1.5 条记录会更好。然而，设置这个速率也会产生问题，因为新数据可能会更早到达，因此我们没有完整的 10 秒，并且 1.5 次查询是不够的。如果我们使用令牌桶，问题实际上会变得更糟，因为令牌桶允许突发在时间帧开始时通过。

然而这个例子过于简单化了，因为实际上我们不能完全知道在任何给定时刻待处理请求的数量，而只是一个上限。所以我们每次都必须根据这个数字来进行限制。

原文

I am trying to spread out data that is received in bursts. This means I have data that is received by some other application in large bursts. For each data entry I need to do some additional requests on some server, at which I should limit the traffic. Hence I try to spread up the requests in the time that I have until the next data burst arrives.

Currently I am using a token-bucket to spread out the data. However because the data I receive is already badly shaped I am still either filling up the queue of pending request, or I get spikes whenever a bursts comes in. So this algorithm does not seem to do the kind of shaping I need.

What other algorithms are there available to limit the requests? I know I have times of high load and times of low load, so both should be handled well by the application.

I am not sure if I was really able to explain the problem I am currently having. If you need any clarifications, just let me know.

EDIT:

I'll try to clarify the problem some more and explain, why a simple rate limiter does not work.

The problem lies in the bursty nature of the traffic and the fact, that burst have a different size at different times. What is mostly constant is the delay between each burst. Thus we get a bunch of data records for processing and we need to spread them out as evenly as possible before the next bunch comes in. However we are not 100% sure when the next bunch will come in, just aproximately, so a simple divide time by number of records does not work as it should.

A rate limiting does not work, because the spread of the data is not sufficient this way. If we are close to saturation of the rate, everything is fine, and we spread out evenly (although this should not happen to frequently). If we are below the threshold, the spreading gets much worse though.

I'll make an example to make this problem more clear:

Let's say we limit our traffic to 10 requests per seconds and new data comes in about every 10 seconds.

When we get 100 records at the beginning of a time frame, we will query 10 records each second and we have a perfect even spread. However if we get only 15 records we'll have one second where we query 10 records, one second where we query 5 records and 8 seconds where we query 0 records, so we have very unequal levels of traffic over time. Instead it would be better if we just queried 1.5 records each second. However setting this rate would also make problems, since new data might arrive earlier, so we do not have the full 10 seconds and 1.5 queries would not be enough. If we use a token bucket, the problem actually gets even worse, because token-buckets allow bursts to get through at the beginning of the time-frame.

However this example over simplifies, because actually we cannot fully tell the number of pending requests at any given moment, but just an upper limit. So we would have to throttle each time based on this number.

分享到QQ

分享到微博