Spark RDD S3 saveAsTextFile 花费很长时间

发布于 2025-01-18 00:37:05 字数 897 浏览 2 评论 0原文

我在 EMR 上有一个 Spark Streaming 作业,它以 30 分钟为一个批次运行,处理数据并最终将输出写入 S3 中的多个不同文件。现在,到 S3 的输出步骤花费的时间太长(大约 30 分钟)才能将文件写入 S3。经过进一步调查,我发现大部分时间是在所有任务将数据写入临时文件夹之后(发生在 20 秒内),其余时间是由于主节点正在从 移动 S3 文件>_temporary 文件夹到目标文件夹并重命名它们等。(类似于:Spark:作业之间的长时间延迟

有关作业配置、文件格式等的其他一些详细信息如下如下:

  • EMR 版本:emr-5.22.0
  • Hadoop 版本:Amazon 2.8.5
  • 应用程序:Hive 2.3.4、Spark 2.4.0、Ganglia 3.7.2
  • S3 文件:完成使用 RDD saveAsTextFile API 和 S3A URL,S3 文件格式是文本

现在虽然 EMRFS 输出提交器 在作业中默认启用,但它不起作用,因为我们使用的是 RDD 和发布后支持的文本文件格式仅限 EMR 6.40 版本。我能想到的优化 S3 保存时间的一种方法是升级 EMR 版本,将 RDD 转换为 DataFrames/Datasets 并使用它们的 API 而不是 saveAsTextFile。是否有其他更简单的解决方案可以优化工作所需的时间?

I have a Spark Streaming job on EMR which runs on batches of 30 mins, processes the data and finally writes the output to several different files in S3. Now the output step to S3 is taking too long (about 30mins) to write the files to S3. On investigating further, I found that the majority time taken is after all tasks have written the data in temporary folder (happens within 20s) and rest of the time taken is due to the fact that the master node is moving the S3 files from _temporary folder to destination folder and renaming them etc. (Similar to: Spark: long delay between jobs)

Some other details on the job configurations, file format etc are as below:

  • EMR version: emr-5.22.0
  • Hadoop version:Amazon 2.8.5
  • Applications:Hive 2.3.4, Spark 2.4.0, Ganglia 3.7.2
  • S3 files: Done using RDD saveAsTextFile API with S3A URL, S3 file format is text

Now although the EMRFS output committer is enabled by default in the job but it is not working since we are using RDDs and text file format which is supported post EMR 6.40 version only. One way that I can think of for optimizing the time taken in S3 save is by upgrading the EMR version, converting RDDs to DataFrames/Datasets and using their APIs instead of saveAsTextFile. Is there any other simpler solution possible to optimize the time taken for the job?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦太阳 2025-01-25 00:37:05

还有其他更简单的解决方案可以优化工作时间吗?

除非您使用S3特定的参数,否则您的工作不仅会很慢,在存在故障的情况下,它们的工作将是不正确的 。 升级之前就向问题发出预警是很好的

由于这对您来说可能很重要,因此,慢速工作的速度甚至在工人失败导致无效的输出选项

  1. 。添加了提交者是有原因的。
  2. 将实际群集FS(例如HDF)作为输出,然后将其上传。

S3A零重命名提交器确实在SaveAsTextFile中工作,但是AWS并不支持AWS,ASF开发人员没有在EMR上测试,因为它是亚马逊自己的叉子。您也许可以让任何S3A连接器Amazon船开始工作,但是如果不这样做,您将独自一人。

Is there any other simpler solution possible to optimize the time taken for the job?

unless you use an s3-specific committer, your jobs will not only be slow, they will be incorrect in the presence of failures. As this may matter to you,it is good that the slow job commits are providing an early warning of problems even before worker failures result in invalid output

options

  1. upgrade. the committers were added for a reason.
  2. use a real cluster fs (e.g HDFS) as the output then upload afterwards.

The s3a zero rename committers do work in saveAsTextFile, but they aren't supported by AWS and the ASF developers don't test on EMR as it is amazon's own fork. you might be able to get any s3a connector amazon ship to work, but you'd be on your own if it didn't.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文