SPARK BZIP2压缩比不有效
今天,在过去的几天中,您正在寻求您的帮助,并通过BZIP2压缩寻求帮助。我们需要将输出文本文件压缩为BZIP2格式。
问题在于,我们仅通过5 GB的未压缩到3.2 GB,由BZIP2压缩。看到其他项目将其5 GB文件压缩到仅400 MB,这使我想知道是否做错了什么。
这是我的代码:
iDf
.repartition(iNbPartition)
.write
.option("compression","bzip2")
.mode(SaveMode.Overwrite)
.text(iOutputPath)
我也在导入此编解码器:
import org.apache.hadoop.io.compress.BZip2Codec
此外,我没有在Spark-Submit中设置任何配置,因为我尝试了很多没有运气的情况。
真的很感谢您的帮助。
Today am seeking your help with an issue am having in the last couple of days with bzip2 compression. We need to compress our output text files into bzip2 format.
The problem is that we only pass from 5 Gb uncompressed to 3.2 Gb compressed with bzip2. Seeing other projects compressing their 5 GB files to only 400 Mb makes me wonder if am doing something wrong.
Here is my code:
iDf
.repartition(iNbPartition)
.write
.option("compression","bzip2")
.mode(SaveMode.Overwrite)
.text(iOutputPath)
I am also importing this codec :
import org.apache.hadoop.io.compress.BZip2Codec
Besides that am not setting any configs in my spark-submit because i've tried many with no luck.
Would really appreciate your help with this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
感谢您的帮助,解决方案是在算法BZIP本身中。实际上,鉴于我的数据以随机的方式被匿名化,因此算法不再有效是非常随机的。
再次感谢
Thanks for your help guys, the solution was in the algorithm bzip itself. Actually given that my data is anonymized in a random way, it was very random that the algorithme is no longer efficient.
Thank you again