迭代列表状态,有数百万张唱片
我想将所有CDC记录存储在列表状态中,并在收到触发消息后将这些记录流式传输到各自的水槽。
列表状态可以成长为一百万个记录,keyedProcessfunction
中的列表状态是否会导致内存问题?计划使用RockSDB状态后端存储该州。在这种情况下,流列表状态的正确方法是什么?
I want to store all the CDC records in list state and streams those records to respective sinks once trigger message is received.
The list state can grow up-to a million records, will iteration over the list state in KeyedProcessFunction
causes memory issues? Planning to use RocksDB state backend to store the state. What is the correct way of streaming the list state in this case?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
关于
listState
的内存使用情况,此答案说明了如何与rocksdb
状态后端一起使用内存: https://stackoverflow.com/a/666622888/19059974看来整个列表似乎需要适合堆,因此根据您的元素大小,可能需要很多内存。
理想情况下,您希望将状态键入较小的分区,因此在增加任务并行性时可以扩散。另外,解决方法可能是使用
mapState
,它在迭代地图上迭代时似乎并未将所有内容加载到内存中。它将使用比listState
的存储更多的存储空间,并且很可能附加不会那么快,但是应该让您使用更少的内存来迭代它。Regarding memory usage of
ListState
this answer explains how memory is used with theRocksDB
state backend: https://stackoverflow.com/a/66622888/19059974It seems that the whole list will need to fit into heap, so depending on the size of your elements, it could take a lot of memory.
Ideally, you would want to key the state into smaller partitions, so it can be spread when increasing the task parallelism. Alternatively, a workaround could be to use a
MapState
which seems to not load all its contents into memory when iterating over the map. It will use more storage than aListState
and most likely appending won't be as fast, but should allow you to iterate over it using less memory.