使用 Hadoop 处理大量小文件
我正在使用 Hadoop 示例程序 WordCount 来处理大量小文件/网页(约 2-3 kB)。由于这与 hadoop 文件的最佳文件大小相差甚远,因此程序非常慢。我想这是因为设置和撕毁工作的成本远远大于工作本身。此类小文件还会导致文件名的命名空间耗尽。
我读到在这种情况下我应该使用 HDFS 存档 (HAR),但我不确定如何修改此程序 WordCount 以从此存档中读取。程序可以在不修改的情况下继续工作吗?或者需要进行一些修改?
即使我将大量文件打包到存档中,问题仍然是这是否会提高性能。我读到,即使我打包多个文件,一个存档中的这些文件也不会由一个映射器处理,而是由许多个映射器处理,在我的情况下(我猜)不会提高性能。
如果这个问题太简单,请理解我是 Hadoop 的新手,对它的经验很少。
I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用 HDFS 不会改变让 hadoop 处理大量小文件的情况。在这种情况下,最好的选择可能是将文件
cat
放入单个(或几个大)文件中。这将减少您拥有的映射器的数量,从而减少需要处理的事物的数量。
如果您在分布式系统上运行,使用 HDFS 可以提高性能。如果您只进行伪分布式(一台机器),那么 HDFS 不会提高性能。限制是机器。
当您操作大量小文件时,将需要大量映射器和缩减器。 setup/down 的处理时间可以与文件本身的处理时间相媲美,从而导致很大的开销。整理文件应该会减少 hadoop 为作业运行的映射器数量,从而提高性能。
使用 HDFS 存储文件的好处是采用多台机器的分布式模式。文件将跨机器存储在块(默认 64MB)中,并且每台机器都能够处理驻留在该机器上的数据块。这减少了网络带宽的使用,因此不会成为处理的瓶颈。
归档文件,如果hadoop要取消归档它们只会导致hadoop仍然有大量小文件。
希望这有助于您的理解。
Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to
cat
the files into a single (or few large) file(s).This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead.
cat
ing the files should reduce the number of mappers hadoop runs for the job, which should improve performance.The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.
从我对 Hadoop 的了解仍然有限,我相信正确的解决方案是创建包含 HTML 文件作为值以及可能包含 URL 作为键的
SequenceFile
(s)。如果您对SequenceFile
执行 M/R 作业,则每个映射器将处理许多文件(取决于拆分大小)。每个文件都将作为单个输入呈现给地图函数。您可能需要使用
SequenceFileAsTextInputFormat
作为InputFormat
来读取这些文件。另请参阅:提供几个非文本文件将文件存储到 Hadoop MapReduce 中的单个映射
From my still limited understanding og Hadoop, I believe the right solution would be to create
SequenceFile
(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over theSequenceFile
(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.You may want to use
SequenceFileAsTextInputFormat
as theInputFormat
to read these files.Also see: Providing several non-textual files to a single map in Hadoop MapReduce
我最近为这篇文章添加了书签以便稍后阅读,并在这里发现了同样的问题:)该条目有点旧,不确定它现在的相关性如何。 Hadoop 的变化正在以非常快的速度发生。
http://www.cloudera.com/blog/2009/ 02/the-small-files-problem/
该博客文章由 Tom White 撰写,他也是《Hadoop:权威指南,第二版》的作者,推荐 Hadoop 入门者阅读。
http://oreilly.com/catalog/0636920010388
I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388
您可以在将文件提交到 Hadoop 之前将它们连接起来吗?
Can you concatenate files before submitting them to Hadoop?
在这种情况下可以使用CombineFileInputFormat,它适用于大量小文件。这会将许多此类文件打包在一个 split 中,因此每个映射器有更多要处理的文件(1 个 split = 1 个映射任务)。
由于运行的映射器数量较少,mapreduce 的总体处理时间也将下降。
由于没有存档感知的输入格式,使用组合文件输入格式将提高性能。
CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.