Spark 结构化流中是否使用 hadoop 提交者?
我的团队正在使用 Spark 结构化流将消息从 kafka 接收到 HDFS。我们正处于迁移此组件以将消息接收到 AWS S3 的后期阶段,与此相关的是,我们遇到了一些有关 hadoop 提交者的问题。
我开始了解默认的“文件”提交者(记录为 这里)在 S3 中使用是不安全的,这就是为什么 spark 文档中的此页面 建议使用“目录”(即staging)提交者,在 hadoop 的更高版本中,他们还建议使用“魔法”提交者。
然而,尚不清楚 Spark 结构化流是否使用提交者。目标中没有“_SUCCESS”文件(与正常的 Spark 作业相比),并且不存在有关流式传输中使用的提交者的文档。
谁能解释一下吗?
My team is using Spark Structured Streaming to sink messages from kafka to HDFS. We're in the late stages of migrating this component to instead sink messages to AWS S3, and in connection with that we hit upon a couple of issues regarding hadoop committers.
I've come to understand that the default "file" committer (documented here) is unsafe to use in S3, which is why this page in the spark documentation recommends using the "directory" (i.e. staging) committer, and in later versions of hadoop they also recommend to use the "magic" committer.
However, it's not clear whether spark structured streaming even use committers. There's no "_SUCCESS" file in the destination (as compared to normal spark jobs), and the documentation regarding committers used in streaming is non-existent.
Can anyone please shed some light on this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
恐怕不是。他们不像提交检查点那样提交工作,而且还没有人做 s3 友好的检查点(截至 2022 年 3 月)。 s3a 提交者中的占位符位于 hadoop-3.3.1 中(您可以提交/中止文件的写入),但必须有人对其进行编码和测试。这可能是您参与代码的机会。
Afraid not. they don't so much commit work as checkpoint it, and nobody has (yet) to do an s3-friendly checkpointer (as of march 2022). the placeholders for that in the s3a committer are in hadoop-3.3.1 (you can commit/abort the writing of a file), but someone has to do the coding and testing for it. this could be your opportunity to get involved in the code.