来自Timestamp的两个不同KAFKA主题的Spark聚集事件

发布于 2025-02-09 23:55:29 字数 1233 浏览 1 评论 0原文

假设一个具有以下两个主题的Kafka系统：

创建的
删除，

它们用于宣传项目的创建和删除。

Kafka中事件的结构是JSON，这两个主题都是相同的：

{
  "id" : "a1cf621a-2a96-4b70-9dd6-3c54a2819eef"
  "timestamp": "2022-01-05T07:31:04.913000"
}

现在，Spark（Scala）可以如何积累删除和创建的量，以使我们通过时间戳获得许多当前项目。

假设Kafka主题中的以下事件

：创建

{"id":"1","timestamp":"2022-01-01T00:00:00.000000"}
{"id":"2","timestamp":"2022-01-02T00:00:00.000000"}

主题：已删除

{"id":"2","timestamp":"2022-01-03T00:00:00.000000"}
{"id":"1","timestamp":"2022-01-04T00:00:00.000000"}

，这基本上是指：

2022-01-01：1创建了项目，总计项目计数为1
2022-- 01-02：1创建项目，项目的总数为2
2022-01-03：1项目已删除，总计项目计数为1
2022-01-04：1项目已删除，总计计数为0，是

0该程序的结果应该是每个时间戳的项目计数，例如：在这里：

----------------------------------------
| timestamp                    | count |
----------------------------------------
| 2022-01-01T00:00:00.000000   | 1     |
| 2022-01-02T00:00:00.000000   | 2     |
| 2022-01-03T00:00:00.000000   | 1     |
| 2022-01-04T00:00:00.000000   | 1     |
----------------------------------------

如何合并两个主题以及时间戳订购的结果？

原文

Assume a kafka system with the following two topics:

created
deleted

They are used to advertise the creation and deletion of items.

The structure of the events in kafka is JSON and the same for both topics:

{
  "id" : "a1cf621a-2a96-4b70-9dd6-3c54a2819eef"
  "timestamp": "2022-01-05T07:31:04.913000"
}

Now how it it possible with spark (scala) to accumulate the the deleted and created amounts such that we get a number of current items by timestamp.

Assume the following events in kafka

topic: created

{"id":"1","timestamp":"2022-01-01T00:00:00.000000"}
{"id":"2","timestamp":"2022-01-02T00:00:00.000000"}

topic: deleted

{"id":"2","timestamp":"2022-01-03T00:00:00.000000"}
{"id":"1","timestamp":"2022-01-04T00:00:00.000000"}

So this basically means:

2022-01-01: 1 item got created, total count of items is 1
2022-01-02: 1 item got created, total count of items is 2
2022-01-03: 1 item got deleted, total count of items is 1
2022-01-04: 1 item got deleted, total count of items is 0

The resulting output of the program should be the count of items per timestamp, as for example here:

----------------------------------------
| timestamp                    | count |
----------------------------------------
| 2022-01-01T00:00:00.000000   | 1     |
| 2022-01-02T00:00:00.000000   | 2     |
| 2022-01-03T00:00:00.000000   | 1     |
| 2022-01-04T00:00:00.000000   | 1     |
----------------------------------------

How can two topics be merged and the result ordered by timestamp?

分享到QQ

分享到微博