当桌子可以使用相同的标识符对桌子进行分区时,如何组织flink中的数据流
我坚信Flink是解决事件处理问题的完美解决方案。我什至设法制作了一个工作原型,但我不相信它几乎是最佳的。
这是我的场景:
- 我有两个运动流
- 一个流包含
event1
并将其存储为JSON
- 另一个流包含
event2
,event3
和event4
,但被存储为gzip'd base64(最终也是JSON
)。我必须使用raw
格式处理此操作,然后使用自定义UDFProcess_events234
- 通过实现tablefunction [row/code> in Scala类。
- 一个流包含
- 我想检测何时到达4个相应的事件,但是我可以使用单个值来加入事件代表的所有4种数据类型。 下面
数据 | 具有 | key2 |
---|---|---|
event1 | 是否 | 是 |
event2 | event2 | 是 |
。 | 是 | 事件 |
event3 | : | he |
请
参阅
CREATE TABLE event_1 (
key1,
...
)
WITH (
'connector' = 'kinesis',
'stream' = 'stream_of_event_1s',
...
'format' = 'json'
)
类型
CREATE TABLE events_234 (
Data BYTES
)
WITH (
'connector' = 'kinesis',
'stream' = 'stream_of_event_1_2_3s',
...
'format' = 'raw'
)
key1 ,3,4
CREATE VIEW event_N // Where N is 2,3,4
AS
SELECT
p.*
FROM
events_234 e
JOIN LATERAL table(process_events234(e.Data)) as p ON TRUE
WHERE
p.eventType = 'eventN' // Where N is 2,3,4
将数据加在一起以获得我的结果,
/*INSERT INTO my_downstream_sink */
SELECT
e1.*, e2.*, e3.*, e4.*
FROM
event_1 e1
INNER JOIN event_2 e2 ON e1.key1 = e2.key1
INNER JOIN event_3 e3 ON e2.key2 = e3.key2
INNER JOIN event_4 e4 ON e2.key2 = e4.key2
我当前的原型在10分钟内工作了几百个记录,但我怀疑它的扩展能力。我担心的是,我无法分区
/keyby
我认为它会存在于同一工人中。我是Flink的新手,但这似乎尤其重要。
我发生的事情是扩展步骤和运动流的数量,以便
- 我加入Event1和event2,然后将其插入新的流
event1+Event2+Event2
由key2 进行分区
- 。加入
event1+event2
与event3
,event4
,
但是,我只是在猜测,并感谢一些专家建议和意见。谢谢!
I'm convinced that Flink is the perfect solution to my event processing problem. I have even managed to produce a working prototype, but I'm not convinced it is even close to optimal.
Here is my scenario:
- I have two kinesis streams
- One stream contains
Event1
and is stored asJSON
- The other stream contains
Event2
,Event3
, andEvent4
but is stored as Gzip'd Base64 (which ultimately is alsoJSON
). I have to process this using theRAW
format and then extract the event data using a custom UDFprocess_events234
- created by implementingTableFunction[Row]
in a Scala class.
- One stream contains
- I want to detect when 4 corresponding events have arrived, but there is no single value I can use to join all 4 data types the events represent. See below:
Data Type | Has key1 | Has key2 |
---|---|---|
Event1 | Yes | No |
Event2 | Yes | Yes |
Event3 | No | Yes |
Event4 | No | Yes |
My prototype notebook has the following:
Define a table for event_1s
CREATE TABLE event_1 (
key1,
...
)
WITH (
'connector' = 'kinesis',
'stream' = 'stream_of_event_1s',
...
'format' = 'json'
)
Define a table for event_1,2,3s
CREATE TABLE events_234 (
Data BYTES
)
WITH (
'connector' = 'kinesis',
'stream' = 'stream_of_event_1_2_3s',
...
'format' = 'raw'
)
Create a view to separate each event 2,3,4
CREATE VIEW event_N // Where N is 2,3,4
AS
SELECT
p.*
FROM
events_234 e
JOIN LATERAL table(process_events234(e.Data)) as p ON TRUE
WHERE
p.eventType = 'eventN' // Where N is 2,3,4
Join the data together to get my results
/*INSERT INTO my_downstream_sink */
SELECT
e1.*, e2.*, e3.*, e4.*
FROM
event_1 e1
INNER JOIN event_2 e2 ON e1.key1 = e2.key1
INNER JOIN event_3 e3 ON e2.key2 = e3.key2
INNER JOIN event_4 e4 ON e2.key2 = e4.key2
My current prototype is working for a few hundred records over a 10 minutes period, but I doubt it's ability to scale. What concerns me is the fact that I am not able to partition
/keyBy
the data such that I imagine it would exist on the same worker. I'm new to Flink, but this seem particularly important.
What occurs to me is to expand the number of steps and kinesis streams such that:
- I join Event1 and Event2 then insert that onto a new stream
Event1+Event2
partitioned bykey2
- Then join
Event1+Event2
withEvent3
,Event4
However, I'm just guessing and would appreciate some expert advice and opinions. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不会担心; Flink的SQL计划者/优化器应该很好地处理。
您可能会发现使用解释和/或查看Flink Web仪表板中的最终作业图很有用,以更清楚地了解如何执行查询。我相信您会发现它正在做您的建议(创建
event1+Event2
流,通过key2
键入它,然后与其他流一起加入)编写event1+Event2
流到运动的费用并再次阅读。I wouldn't worry; Flink's SQL planner/optimizer should handle this just fine.
You may find it useful to use EXPLAIN and/or look at the resulting job graph in the Flink web dashboard to get a clearer idea of how the query is being executed. I believe you'll find that it's doing exactly what you propose (creating an
Event1+Event2
stream, keying it bykey2
, and then joining with the other streams) without the expense of writing theEvent1+Event2
stream out to Kinesis and reading it in again.