当桌子可以使用相同的标识符对桌子进行分区时，如何组织flink中的数据流

发布于 2025-01-23 13:47:23 字数 2592 浏览 5 评论 0原文

我坚信Flink是解决事件处理问题的完美解决方案。我什至设法制作了一个工作原型，但我不相信它几乎是最佳的。

这是我的场景：

我有两个运动流
- 一个流包含event1并将其存储为JSON
- 另一个流包含event2，event3和event4，但被存储为gzip'd base64（最终也是JSON）。我必须使用raw格式处理此操作，然后使用自定义UDF Process_events234 - 通过实现tablefunction [row/code> in Scala类。
我想检测何时到达4个相应的事件，但是我可以使用单个值来加入事件代表的所有4种数据类型。下面

数据	具有	key2
event1	是否	是
event2	event2	是
。	是	事件
event3	：	he

请

参阅

CREATE TABLE event_1 (
  key1,
  ...
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1s',
    ...
    'format' = 'json'
)

类型

CREATE TABLE events_234 (
  Data BYTES
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1_2_3s',
    ...
    'format' = 'raw'
)

key1 ，3,4

CREATE VIEW event_N // Where N is 2,3,4
AS
SELECT 
      p.*
FROM
      events_234 e
      JOIN LATERAL table(process_events234(e.Data)) as p ON TRUE
WHERE
      p.eventType = 'eventN' // Where N is 2,3,4

将数据加在一起以获得我的结果，

/*INSERT INTO my_downstream_sink */
SELECT
    e1.*, e2.*, e3.*, e4.*
FROM
    event_1 e1
    INNER JOIN event_2 e2 ON e1.key1 = e2.key1
    INNER JOIN event_3 e3 ON e2.key2 = e3.key2
    INNER JOIN event_4 e4 ON e2.key2 = e4.key2

我当前的原型在10分钟内工作了几百个记录，但我怀疑它的扩展能力。我担心的是，我无法分区/keyby我认为它会存在于同一工人中。我是Flink的新手，但这似乎尤其重要。

我发生的事情是扩展步骤和运动流的数量，以便

我加入Event1和event2，然后将其插入新的流event1+Event2+Event2由key2 进行分区
。加入event1+event2与event3，event4，

但是，我只是在猜测，并感谢一些专家建议和意见。谢谢！

原文

I'm convinced that Flink is the perfect solution to my event processing problem. I have even managed to produce a working prototype, but I'm not convinced it is even close to optimal.

Here is my scenario:

I have two kinesis streams
- One stream contains Event1 and is stored as JSON
- The other stream contains Event2, Event3, and Event4 but is stored as Gzip'd Base64 (which ultimately is also JSON). I have to process this using the RAW format and then extract the event data using a custom UDF process_events234 - created by implementing TableFunction[Row] in a Scala class.
I want to detect when 4 corresponding events have arrived, but there is no single value I can use to join all 4 data types the events represent. See below:

Data Type	Has key1	Has key2
Event1	Yes	No
Event2	Yes	Yes
Event3	No	Yes
Event4	No	Yes

My prototype notebook has the following:

Define a table for event_1s

CREATE TABLE event_1 (
  key1,
  ...
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1s',
    ...
    'format' = 'json'
)

Define a table for event_1,2,3s

CREATE TABLE events_234 (
  Data BYTES
)
WITH (
    'connector' = 'kinesis',
    'stream' = 'stream_of_event_1_2_3s',
    ...
    'format' = 'raw'
)

Create a view to separate each event 2,3,4

CREATE VIEW event_N // Where N is 2,3,4
AS
SELECT 
      p.*
FROM
      events_234 e
      JOIN LATERAL table(process_events234(e.Data)) as p ON TRUE
WHERE
      p.eventType = 'eventN' // Where N is 2,3,4

Join the data together to get my results

/*INSERT INTO my_downstream_sink */
SELECT
    e1.*, e2.*, e3.*, e4.*
FROM
    event_1 e1
    INNER JOIN event_2 e2 ON e1.key1 = e2.key1
    INNER JOIN event_3 e3 ON e2.key2 = e3.key2
    INNER JOIN event_4 e4 ON e2.key2 = e4.key2

My current prototype is working for a few hundred records over a 10 minutes period, but I doubt it's ability to scale. What concerns me is the fact that I am not able to partition/keyBy the data such that I imagine it would exist on the same worker. I'm new to Flink, but this seem particularly important.

What occurs to me is to expand the number of steps and kinesis streams such that: