当前位置：文江博客话题详情

在 Hadoop 中将多个文件合并为一个

发布于 2024-09-15 23:25:15 字数 104 浏览 2 评论 0原文

我将多个小文件放入我的输入目录中，我想将它们合并为一个文件，而不使用本地文件系统或编写 mapreds。有没有办法使用 hadoof fs 命令或 Pig 来做到这一点？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜心 2024-09-22 23:25:15

为了将所有内容保留在网格上，请使用带有单个减速器和 cat 的 hadoop 流作为映射器和减速器（基本上是 noop） - 使用 MR 标志添加压缩。

hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
    -Dmapred.reduce.tasks=1 \
    -Dmapred.job.queue.name=$QUEUE \
    -input "$INPUT" \
    -output "$OUTPUT" \
    -mapper cat \
    -reducer cat

如果你想压缩添加
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
    -Dmapred.reduce.tasks=1 \
    -Dmapred.job.queue.name=$QUEUE \
    -input "$INPUT" \
    -output "$OUTPUT" \
    -mapper cat \
    -reducer cat

If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

回复收藏 0 原文

风柔一江水 2024-09-22 23:25:15

hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>

hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>

回复收藏 0 原文

春夜浅 2024-09-22 23:25:15

好吧...我想出了一种使用 hadoop fs 命令的方法 -

hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]

当我测试它时它起作用了...有人能想到的任何陷阱吗？

谢谢！

okay...I figured out a way using hadoop fs commands -

hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]

It worked when I tested it...any pitfalls one can think of?

Thanks!

回复收藏 0 原文

菩提树下叶撕阳。 2024-09-22 23:25:15

如果您设置fuse将HDFS挂载到本地目录，那么您的输出可以是挂载的文件系统。

例如，我将 HDFS 本地安装到 /mnt/hdfs 。我运行以下命令，效果很好：

hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt

当然，还有其他原因使用 fusion 将 HDFS 挂载到本地目录，但这对我们来说是一个很好的副作用。

If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.

For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:

hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt

Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.

回复收藏 0 原文

可爱暴击 2024-09-22 23:25:15

您可以使用该工具 HDFSConcat是HDFS 0.21中的新功能，用于执行此操作而不会产生副本成本。

回复收藏 0 原文

濫情▎り 2024-09-22 23:25:15

如果您在 Hortonworks 集群中工作，并且想要将 HDFS 位置中存在的多个文件合并到单个文件中，那么您可以运行“hadoop-streaming-2.7.1.2.3.2.0-2950.jar”jar，它运行单个减速器并获取将文件合并到 HDFS 输出位置。

$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

您可以从以下位置下载这个 jar
获取 hadoop Streaming jar

如果您正在编写 Spark 作业并希望获得合并文件以避免创建多个 RDD 和性能瓶颈，在转换 RDD 之前使用这段代码

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile(" hdfs://...../filename)

这会将所有部分文件合并为一个并将其再次保存到 hdfs 位置

If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.

$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

You can download this jar from
Get hadoop streaming jar

If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

This will merge all part files into one and save it again into hdfs location

回复收藏 0 原文

在巴黎塔顶看东京樱花 2024-09-22 23:25:15

从 Apache Pig 的角度解决这个问题，

要通过 Pig 合并两个具有相同模式的文件，可以使用 UNION 命令

 A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
 B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1) 
 C = UNION A,B
 store C into 'tmp/fileoutput' Using PigStorage('\t')

Addressing this from Apache Pig perspective,

To merge two files with identical schema via Pig, UNION command can be used

 A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
 B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1) 
 C = UNION A,B
 store C into 'tmp/fileoutput' Using PigStorage('\t')

回复收藏 0 原文

〆凄凉。 2024-09-22 23:25:15

所有的解决方案都相当于做一个

hadoop fs -cat [dir]/* > tmp_local_file  
hadoop fs -copyFromLocal tmp_local_file

它只意味着本地m/c I/O位于数据传输的关键路径上。

All the solutions are equivalent to doing a

hadoop fs -cat [dir]/* > tmp_local_file  
hadoop fs -copyFromLocal tmp_local_file

it only means that the local m/c I/O is on the critical path of data transfer.

回复收藏 0 原文

~没有更多了~

关于作者

策马西风

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

在 Hadoop 中将多个文件合并为一个

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

在 Hadoop 中将多个文件合并为一个

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。