与我正在做的时间窗口相比,有没有更好的方法来加入两个键流的更新?

发布于 2025-01-24 06:11:01 字数 789 浏览 2 评论 0原文

因此,我拥有的是一项Flink作业,该作业会从Kafka接收一条消息,并将创建两个然后通过作业发送的updateObject事件。为了简单起见,一个是增加事件,另一个是一个减少事件(因此,消息传来,业务逻辑决定了要创建的更新,因此状态已正确更新)。

为了给出示例updateObject:

{
   "batchId":"b1",
   "type":"increase",
   "amount":1,
   "stateToUpdate":"stateAlpha",
   "stateToUpdateValue":null
}

然后在作业中下游发送updateObjects。流由一个将具有UpdateObject1更新一个状态的值键入,而UpdateObject2由密钥更新一个状态。每次更新其状态后,更新值都将输入UpdateObject的StateToupDateValue中,并且每个值都通过工作发送。

现在,棘手的作品是最后一步。作业的最终输出是每个消息中updateObject的数组。可以想到的最好的想法是将一张1秒钟的翻滚时间窗口收集到更新对象,然后在1秒钟后触发时,它将检查其窗口中的所有内容,并将其与相同的批处理配对并将其放入最终输出对象,然后输出。显然,这并不是一个人并不保证所有人可能会在第1次进入该窗口,而是因为事情只是坐在那里而导致处理的延迟。

不能保证为每条消息创建两个updateObject,因为它非常逐案类型的东西。而且由于updateObjects被分为不同的键流,因为它们的键总是不同每个。一旦键合发生,他们就不会依恋。

因此,我想知道是否有人能想到一种更好的方法来做到这一点,因为我觉得一定要有。

So what I have is a Flink job that receives a message from Kafka and will create two updateObject events that are then sent through the job. For simplicity, one is an increase event and the other is a decrease event (So message comes in, business logic dictates what updateObjects to create so state is updated properly).

To give a sample updateObject:

{
   "batchId":"b1",
   "type":"increase",
   "amount":1,
   "stateToUpdate":"stateAlpha",
   "stateToUpdateValue":null
}

The updateObjects are then sent downstream in the job. Stream is keyed by a value which will have updateObject1 update one state and updateObject2 update a different one by the key. After each updates their state the updated value is in put into the updateObject's stateToUpdateValue and each is sent along through the job.

Now the piece that is tricky is the last step. The final output of the job is an array of the updateObjects from each message. The best idea that could be thought of was to have a tumbling time window of 1 second that collects updateObjects and then when it triggers after 1 second it will check all that are in its window and pair the ones with the same batchId and put them in the final output object and then output it. Obviously this is not one not guaranteeing all may have gotten to that window in the 1 second time, but it also causes a delay in processing since things just sit there.

There is not guarantee that there is always two updateObjects created for every message since its very much a case-by-case type of thing. And since the updateObjects get split into different keyed streams because their keys are always different, it can't be that a single object goes through the first keyed state and updates it and then goes through the next one with a single object that gets updated accordingly for each. Once the keying happens they aren't attached so to speak.

So I wanted to know if anyone could think of a better way to do this as I feel like there definitely has to be.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

饭团 2025-01-31 06:11:01

您是正确的,可以怀疑有一种更好的方法可以做到这一点。

使用翻滚的窗户这样做有两个问题:

  1. 您将平均等待半秒钟,而最坏的情况是一秒钟,因为通常不会花那么长时间。
  2. 尽管您经常花所有的时间在等待,但这种基于窗户的方法仍然很容易弄错。即使批次只有两个事件,即使它们在几毫秒内进行处理,它们仍然可以落在窗户边界的相对侧。这是因为一个一秒长的窗口将从12:00:00.000到12:00:00:999,您的活动可能在12:00:00.999和12:00:01.001出现。

这是一个替代方法:

在管道的末尾,通过批处理重新键入流,然后使用keyedProcesfunction将批处理粘合在一起。

例如,当每个属于特定批处理的记录到达时,请将其附加到键入的ListState对象。如果您确切地知道每批属于每批有多少条记录,那么您可以将计数器保持在估价中,当批次完成时,您可以在列表上迭代并产生最终结果(并且不要忘记在何时清除状态您已经完成了)。或者,您可以使用键盘计时器等待特定的持续时间(相对于每批次的第一个记录的到来),并在计时器启动时产生最终结果。

与状态一起工作,并与过程函数在Flink文档的教程中,以及随附的训练练习。

另外,您可以使用Flink SQL使用Windows进行此操作。

You are right to suspect that there's a better way to do this.

Doing this with tumbling windows has two problems:

  1. You are going to wait half a second, on average, and one second at worst, for something that typically won't take that long.
  2. Despite the fact that you're often spending all this time waiting around, this approach based on windows can still rather easily get things wrong. Even if a batch has only two events, and even if they are processed within a few milliseconds of each other, they can still fall on opposite sides of the window boundaries. This is because a one-second-long window is going to be from 12:00:00.000 to 12:00:00:999, for example, and your events might show up at 12:00:00.999 and 12:00:01.001.

Here's an alternative:

At the end of your pipeline, re-key the stream by the batchId, and then use a KeyedProcessFunction to glue the batches back together.

For example, as each record belonging to a specific batch arrives, append it to a keyed ListState object. If you know exactly how many records belong to each batch, then you can keep a counter in ValueState, and when the batch is complete, you can iterate over the list and produce the final result (and don't forget to clear the state when you're finished with it). Or you can use a keyed Timer to wait for some specific duration (relative to the arrival of the first record from each batch), and produce the final result when the Timer fires.

There are examples of working with state and with process functions in the tutorials in Flink's documentation, and in the accompanying training exercises.

Alternatively, you could do this with Flink SQL, using OVER windows.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文