如何获取滚动窗口的开始和结束时间?

发布于 2025-01-18 06:04:34 字数 136 浏览 4 评论 0原文

我获得了创建和更新时间字段的数据流。 我需要通过tumblingwindow 1小时来获得创建和更新的记录。

如何在tumblingwindow开始和结束时间之间过滤createon?

I get a stream of data that has CreatedOn and UpdatedOn time field.
I need to get a count of Created and Updated records by TumblingWindow of 1 hour.

How can I filter CreatedOn to be between the TumblingWindow start and end time?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

客…行舟 2025-01-25 06:04:34

在下面编辑了以下评论

我将做出以下假设:

  • 如果在上游系统中创建了一行,您将获得具有相同 CreatedOn 和 UpdatedOn 时间戳的记录
  • 如果在上游系统中更新了一行,您 将获得具有相同 CreatedOn 和 UpdatedOn 时间戳的记录使用 UpdatedOn 与 CreatedOn 获取不同的记录

假设您的输入记录如下所示:

{"rowId" : 123, "CreatedOn": "2022-01-01 00:00:00", "UpdatedOn": null }
{"rowId" : 123, "CreatedOn": "2022-01-01 00:00:00", "UpdatedOn": "2022-01-01 01:10:00"}

通过以下查询,您可以获得每小时创建/更新的行数:

SELECT
    DATEADD(hour,-1,System.Timestamp()) AS WindowStart,
    System.Timestamp() AS WindowEnd,
    SUM(CASE WHEN CreatedOn = UpdatedOn THEN 1 ELSE 0 END) AS CreatedCount,
    SUM(CASE WHEN CreatedOn != UpdatedOn THEN 1 ELSE 0 END) AS UpdatedCount
FROM input1
GROUP BY
    Tumbling(hour,1)

这里给您:

WindowStartWindowEndCreatedCountUpdatedCount
2022-04-01T10:00:002022-04-01T11:00:0011

请注意这里的窗口时间是如何在挂钟时间上完成的(实时服务的到达时间,或者当我执行本地查询时在此处查询开始时间)在 VS Code 中运行),所以我的 2 个操作被重新组合在一个窗口上。实时运行时,如果操作时间和 ASA 接收行(到达时间)之间没有太多延迟,它们肯定会处于不同的时间窗口中。

如果您想对操作时间强制执行时间进度,可以使用 TIMESTAMP BY 子句来实现:

SELECT
    DATEADD(hour,-1,System.Timestamp()) AS WindowStart,
    System.Timestamp() AS WindowEnd,
    SUM(CASE WHEN CreatedOn = UpdatedOn THEN 1 ELSE 0 END) AS CreatedCount,
    SUM(CASE WHEN CreatedOn != UpdatedOn THEN 1 ELSE 0 END) AS UpdatedCount
FROM input1 TIMESTAMP BY COALESCE(UpdatedOn, CreatedOn)
GROUP BY
    Tumbling(hour,1)

现在输出变为:

WindowStartWindowEndCreatedCountUpdatedCount
2021-12-31T23:00:002022-01 -01T00:00:0010
2022-01-01T01:00:002022-01-01T02:00:0001

Edited following comments below

I'll make the following assumptions:

  • If a row is created in your upstream system, you get a record with equal CreatedOn and UpdatedOn timestamps
  • If a row is updated in your upstream system, you get a record with UpdatedOn different from CreatedOn

Let's say your input records look like:

{"rowId" : 123, "CreatedOn": "2022-01-01 00:00:00", "UpdatedOn": null }
{"rowId" : 123, "CreatedOn": "2022-01-01 00:00:00", "UpdatedOn": "2022-01-01 01:10:00"}

With the following query you can get a count of created/updated rows per hour:

SELECT
    DATEADD(hour,-1,System.Timestamp()) AS WindowStart,
    System.Timestamp() AS WindowEnd,
    SUM(CASE WHEN CreatedOn = UpdatedOn THEN 1 ELSE 0 END) AS CreatedCount,
    SUM(CASE WHEN CreatedOn != UpdatedOn THEN 1 ELSE 0 END) AS UpdatedCount
FROM input1
GROUP BY
    Tumbling(hour,1)

Here giving you:

WindowStartWindowEndCreatedCountUpdatedCount
2022-04-01T10:00:002022-04-01T11:00:0011

Notice how the windowing is done here on the wall clock time (arrival time on the live service, or here query start time when I do a local run in VS Code), so my 2 operations are regrouped on a single window. When running live, if you don't have much delay between the operation time and when ASA receives the row (arrival time), they will surely be in separate time windows.

If you want to enforce time progression on the operation time, you can do so with the TIMESTAMP BY clause:

SELECT
    DATEADD(hour,-1,System.Timestamp()) AS WindowStart,
    System.Timestamp() AS WindowEnd,
    SUM(CASE WHEN CreatedOn = UpdatedOn THEN 1 ELSE 0 END) AS CreatedCount,
    SUM(CASE WHEN CreatedOn != UpdatedOn THEN 1 ELSE 0 END) AS UpdatedCount
FROM input1 TIMESTAMP BY COALESCE(UpdatedOn, CreatedOn)
GROUP BY
    Tumbling(hour,1)

Now the output becomes:

WindowStartWindowEndCreatedCountUpdatedCount
2021-12-31T23:00:002022-01-01T00:00:0010
2022-01-01T01:00:002022-01-01T02:00:0001
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文