当桌子可以使用相同的标识符对桌子进行分区时,如何组织flink中的数据流

发布于 2025-01-23 13:47:23 字数 2592 浏览 5 评论 0原文

我坚信Flink是解决事件处理问题的完美解决方案。我什至设法制作了一个工作原型,但我不相信它几乎是最佳的。

这是我的场景:

  • 我有两个运动流
    • 一个流包含event1并将其存储为JSON
    • 另一个流包含event2event3event4,但被存储为gzip'd base64(最终也是JSON)。我必须使用raw格式处理此操作,然后使用自定义UDF Process_events234 - 通过实现tablefunction [row/code> in Scala类。
  • 我想检测何时到达4个相应的事件,但是我可以使用单个值来加入事件代表的所有4种数据类型。 下面
数据具有key2
event1是否
event2event2
事件
event3he

参阅

CREATE TABLE event_1 (
  key1,
  ...
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1s',
    ...
    'format' = 'json'
)

类型

CREATE TABLE events_234 (
  Data BYTES
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1_2_3s',
    ...
    'format' = 'raw'
)

key1 ,3,4

CREATE VIEW event_N // Where N is 2,3,4
AS
SELECT 
      p.*
FROM
      events_234 e
      JOIN LATERAL table(process_events234(e.Data)) as p ON TRUE
WHERE
      p.eventType = 'eventN' // Where N is 2,3,4

将数据加在一起以获得我的结果,

/*INSERT INTO my_downstream_sink */
SELECT
    e1.*, e2.*, e3.*, e4.*
FROM
    event_1 e1
    INNER JOIN event_2 e2 ON e1.key1 = e2.key1
    INNER JOIN event_3 e3 ON e2.key2 = e3.key2
    INNER JOIN event_4 e4 ON e2.key2 = e4.key2

我当前的原型在10分钟内工作了几百个记录,但我怀疑它的扩展能力。我担心的是,我无法分区/keyby我认为它会存在于同一工人中。我是Flink的新手,但这似乎尤其重要。

我发生的事情是扩展步骤和运动流的数量,以便

  • 我加入Event1和event2,然后将其插入新的流event1+Event2+Event2key2 进行分区
  • 。加入event1+event2event3event4

但是,我只是在猜测,并感谢一些专家建议和意见。谢谢!

I'm convinced that Flink is the perfect solution to my event processing problem. I have even managed to produce a working prototype, but I'm not convinced it is even close to optimal.

Here is my scenario:

  • I have two kinesis streams
    • One stream contains Event1 and is stored as JSON
    • The other stream contains Event2, Event3, and Event4 but is stored as Gzip'd Base64 (which ultimately is also JSON). I have to process this using the RAW format and then extract the event data using a custom UDF process_events234 - created by implementing TableFunction[Row] in a Scala class.
  • I want to detect when 4 corresponding events have arrived, but there is no single value I can use to join all 4 data types the events represent. See below:
Data TypeHas key1Has key2
Event1YesNo
Event2YesYes
Event3NoYes
Event4NoYes

My prototype notebook has the following:

Define a table for event_1s

CREATE TABLE event_1 (
  key1,
  ...
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1s',
    ...
    'format' = 'json'
)

Define a table for event_1,2,3s

CREATE TABLE events_234 (
  Data BYTES
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1_2_3s',
    ...
    'format' = 'raw'
)

Create a view to separate each event 2,3,4

CREATE VIEW event_N // Where N is 2,3,4
AS
SELECT 
      p.*
FROM
      events_234 e
      JOIN LATERAL table(process_events234(e.Data)) as p ON TRUE
WHERE
      p.eventType = 'eventN' // Where N is 2,3,4

Join the data together to get my results

/*INSERT INTO my_downstream_sink */
SELECT
    e1.*, e2.*, e3.*, e4.*
FROM
    event_1 e1
    INNER JOIN event_2 e2 ON e1.key1 = e2.key1
    INNER JOIN event_3 e3 ON e2.key2 = e3.key2
    INNER JOIN event_4 e4 ON e2.key2 = e4.key2

My current prototype is working for a few hundred records over a 10 minutes period, but I doubt it's ability to scale. What concerns me is the fact that I am not able to partition/keyBy the data such that I imagine it would exist on the same worker. I'm new to Flink, but this seem particularly important.

What occurs to me is to expand the number of steps and kinesis streams such that:

  • I join Event1 and Event2 then insert that onto a new stream Event1+Event2 partitioned by key2
  • Then join Event1+Event2 with Event3, Event4

However, I'm just guessing and would appreciate some expert advice and opinions. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

摇划花蜜的午后 2025-01-30 13:47:23

我不会担心; Flink的SQL计划者/优化器应该很好地处理。

您可能会发现使用解释和/或查看Flink Web仪表板中的最终作业图很有用,以更清楚地了解如何执行查询。我相信您会发现它正在做您的建议(创建event1+Event2流,通过key2键入它,然后与其他流一起加入)编写event1+Event2流到运动的费用并再次阅读。

I wouldn't worry; Flink's SQL planner/optimizer should handle this just fine.

You may find it useful to use EXPLAIN and/or look at the resulting job graph in the Flink web dashboard to get a clearer idea of how the query is being executed. I believe you'll find that it's doing exactly what you propose (creating an Event1+Event2 stream, keying it by key2, and then joining with the other streams) without the expense of writing the Event1+Event2 stream out to Kinesis and reading it in again.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文