Azure 流分析:如果作业查询是按天计算的 TUMBLINGWINDOW,流分析作业何时实际处理数据?

发布于 2025-01-11 22:07:11 字数 675 浏览 0 评论 0原文

上下文

我已经使用 Azure 门户创建了一个流作业,该门户使用每日 TUMBLINGWINDOW 聚合数据。下面附加了一个从文档修改而来的代码片段,它显示了类似的逻辑。

SELECT
    DATEADD(day, -1, System.Timestamp()) AS WindowStart
    System.Timestamp() AS WindowEnd, 
    TollId, 
    COUNT(*)
FROM Input TIMESTAMP BY EntryTime  
GROUP BY TumblingWindow(day, 1), TollId

问题

如果 TUMBLINGWINDOW 在窗口末尾输出(如果我在任何给定日期的午夜开始工作,则意味着第二天的午夜过后不久 )那么白天数据是否仍在处理中,还是仅根据查询输出的时间进行处理?

如果能详细解释一下这是如何工作的那就太好了。还没有找到任何真正详细解释这些概念的文档(带有这些边缘情况)

想衡量如果我停止运行作业并从“上次停止时”重新启动它,它是否仍然会导致相同的结果聚合就好像我一直把它打开一样(如果会的话怎么办)?请记住我正在使用日常的 TUMBLINGWINDOW?

Context

I have created a streaming job using Azure portal which aggregates data using a day wise TUMBLINGWINDOW. Have attached a code snippet below, modified from the docs, which shows similar logic.

SELECT
    DATEADD(day, -1, System.Timestamp()) AS WindowStart
    System.Timestamp() AS WindowEnd, 
    TollId, 
    COUNT(*)
FROM Input TIMESTAMP BY EntryTime  
GROUP BY TumblingWindow(day, 1), TollId

Question

If the TUMBLINGWINDOW outputs at the end of the window (which in the case that I start my job at midnight of any given day would mean shortly after midnight of the next day) then during the day is data still being processed or does the processing only happen based on when the query is outputting?

An explanation in detail as to how this would work would be great. Haven't found any documentation which really explains these concepts in detail (with these edge cases)

Thoughts

I am trying to gauge how if I stop a job from running and start it again from "When Last Stopped" would it still lead to the same aggregation as if I'd left it on all the time (if it would then how)? Bearing in mind I am using a day wise TUMBLINGWINDOW?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

℉絮湮 2025-01-18 22:07:11

滚动窗口的输出时间是绝对的,并且不依赖于查询开始时间。每日滚动窗口在 00:00:00AM 生成输出,每小时生成一个输出(00:00:00AM、01:00:00AM...),等等。

所以这里的作业正在等待 24 小时,耐心地将数据加载到内存中,直到凌晨 00:00,这样它就可以执行计算并输出结果。然后它又开始等待。

在这里,通过每日窗口,没有什么可以阻止您从上午 00:01 到下午 23:59 停止作业。

编辑 - 这不正确 - 已在下面修复请注意,当您启动它时,开始时间选项 需要覆盖缺失的时间(因此“上次启动时” - 因为我们检查点数据 - 或自定义时间24 小时前)。

更正)请注意,当您启动它时,开始时间选项需要覆盖您想要覆盖的输出窗口 - ASA 将重新加载所有必要的数据,即使是在该时间之前。你用开始时间驱动的是输出时间,而不是数据输入周期。

只要数据仍然存在(请注意事件中心的保留期,默认为 1 天),您就可以暂停整整一周,并让作业重新处理整个期间以发出 7 个结果。为此,您只需要一个涵盖该时间段的开始时间。

请注意,重新摄取整个数据集并计算其操作需要时间。因此,如果您绝对需要每日平均值在上午 00:00:00 输出,请提前几分钟重新启动作业,以便它可以赶上。否则,您将在 00:00:10AM(或者将数据重新加载到内存中所需的任何时间)获得该输出。

The output time of the tumbling window is absolute, and not dependent on the query start time. A daily tumbling window generates an output at 00:00:00AM, an hourly one every top of the hour (00:00:00AM, 01:00:00AM...), etc.

So here the job is waiting for 24h, loading patiently the data in memory, until it's 00:00AM so it can perform the computation and output the results. Then it starts waiting again.

Here, with a daily window, nothing prevents you from stopping the job from 00:01AM to 23:59PM.

(EDIT - THIS IS NOT CORRECT - FIXED BELOW) Just be mindful that when you start it, the start time option needs to cover the missing time (so either 'when last started' - because we checkpoint data - or custom time 24h before).

(CORRECTION) Just be mindful that when you start it, the start time option needs to cover the output window you want covered - ASA will reload all the necessary data even if it's before that time. What you drive with the start time is the output time, not the data input period.

As long as the data is still here (be mindful of the retention period of the event hub, by default 1 day), you could pause for an entire week, and have the job reprocess the whole period to emit 7 results. For that you just need a start time that covers the period.

Note that it takes time to re-ingest the whole dataset, and compute the operation on it. So if you absolutely need your daily average to be out at 00:00:00AM, then restart the job a few minutes prior so it can catch up. Else you will get that output at 00:00:10AM (or whatever the times it take to reload the data in memory).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文