如何调查Spark中发生的Kryo缓冲区溢出?
我遇到了Kryo缓冲区溢出异常,但是我真的不明白哪些数据比当前的缓冲区大小还需要更多。我已经有spark.kryoserializer.buffer.max
设置为256MB,甚至在数据集项目上应用的to绳,这应该比Kryo所需的要大得多,却要小于(每个项目)。
我知道我可以增加此参数,现在我会增加此参数,但是我认为这不是一个好习惯,可以在不调查发生的事情时简单地增加资源(与我得到OOM相同并简单地增加RAM分配的情况相同)。不检查什么需要更多的RAM)
=>因此,是否有一种方法可以调查沿火花DAG执行的缓冲区中放置的内容?
我在Spark UI中找不到任何东西。
请注意, Kryo Serialialser如何在Spark 中分配缓冲液不是同一问题。它询问它是如何工作的(实际上没有人回答),我问如何调查。在上面的问题中,所有答案讨论要使用的参数,我知道要使用哪种参数,并且我确实通过增加参数来避免例外。但是,我已经消耗了太多的RAM,需要优化它,包括Kryo缓冲区。
I encountered a kryo buffer overflow exception, but I really don't understand what data could require more than the current buffer size. I already have spark.kryoserializer.buffer.max
set to 256Mb, and even a toString applied on the dataset items, which should be much bigger than what kryo requires, take less than that (per item).
I know I can increase this parameter, and I will right now, but I don't think this is a good practice to simply increase resources when reaching a bound without investigating what happens (same as if I get an OOM and simply increase ram allocation without checking what takes more ram)
=> So, is there a way to investigate what is put in the buffer along the spark dag execution?
I couldn't find anything in the spark ui.
Note that How Kryo serializer allocates buffer in Spark is not the same question. It ask how it works (and actually no one answers it), and I ask how to investigate. In the above question, all answers discuss the parameters to use, I know which param to use and I do manage to avoid the exception by increasing the parameters. However, I already consume too much ram, and need to optimize it, kryo buffer included.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通过网络发送或写入磁盘或持续存在的所有数据都应与火花DAG一起序列化。因此,Kryo序列化缓冲区必须大于您尝试序列化的任何对象,并且必须小于2048m。
https://spark.apache.org/docs.org/docs/latest/tuning。 html#数据序列化
All data that is sent over the network or written to the disk or persisted in the memory should be serialized along with the spark dag. Hence, Kryo serialization buffer must be larger than any object you attempt to serialize and must be less than 2048m.
https://spark.apache.org/docs/latest/tuning.html#data-serialization