在 Hadoop 中将多个文件合并为一个
我将多个小文件放入我的输入目录中,我想将它们合并为一个文件,而不使用本地文件系统或编写 mapreds。有没有办法使用 hadoof fs 命令或 Pig 来做到这一点?
谢谢!
I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
为了将所有内容保留在网格上,请使用带有单个减速器和 cat 的 hadoop 流作为映射器和减速器(基本上是 noop) - 使用 MR 标志添加压缩。
如果你想压缩添加
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
好吧...我想出了一种使用
hadoop fs
命令的方法 -当我测试它时它起作用了...有人能想到的任何陷阱吗?
谢谢!
okay...I figured out a way using
hadoop fs
commands -It worked when I tested it...any pitfalls one can think of?
Thanks!
如果您设置fuse将HDFS挂载到本地目录,那么您的输出可以是挂载的文件系统。
例如,我将 HDFS 本地安装到
/mnt/hdfs
。我运行以下命令,效果很好:当然,还有其他原因使用 fusion 将 HDFS 挂载到本地目录,但这对我们来说是一个很好的副作用。
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to
/mnt/hdfs
locally. I run the following command and it works great:Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.
您可以使用该工具 HDFSConcat是HDFS 0.21中的新功能,用于执行此操作而不会产生副本成本。
You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.
如果您在 Hortonworks 集群中工作,并且想要将 HDFS 位置中存在的多个文件合并到单个文件中,那么您可以运行“hadoop-streaming-2.7.1.2.3.2.0-2950.jar”jar,它运行单个减速器并获取将文件合并到 HDFS 输出位置。
您可以从以下位置下载这个 jar
获取 hadoop Streaming jar
如果您正在编写 Spark 作业并希望获得合并文件以避免创建多个 RDD 和性能瓶颈,在转换 RDD 之前使用这段代码
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile(" hdfs://...../filename)
这会将所有部分文件合并为一个并将其再次保存到 hdfs 位置
If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.
You can download this jar from
Get hadoop streaming jar
If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
This will merge all part files into one and save it again into hdfs location
从 Apache Pig 的角度解决这个问题,
要通过 Pig 合并两个具有相同模式的文件,可以使用 UNION 命令
Addressing this from Apache Pig perspective,
To merge two files with identical schema via Pig, UNION command can be used
所有的解决方案都相当于做一个
它只意味着本地m/c I/O位于数据传输的关键路径上。
All the solutions are equivalent to doing a
it only means that the local m/c I/O is on the critical path of data transfer.