Hadoop:如何访问(许多)照片图像以进行map/reduce处理?
我在本地文件系统上保存了超过 1000 万张照片。现在我想遍历每一个来分析照片的二进制,看看它是否是一只狗。我主要想在集群hadoop环境上进行分析。问题是,我应该如何设计地图方法的输入?比方说,在地图方法中, new FaceDetection(photoInputStream).isDog()
是分析的所有底层逻辑。
具体来说, 我应该将所有照片上传到 HDFS
吗?假设是,
如何在
map
方法中使用它们?是否可以将输入(到
地图
)作为一个文本文件,其中包含所有照片路径(在HDFS
中),每行一行,并且在map 方法,加载二进制文件,如下所示:photoInputStream = getImageFromHDFS(photopath);
(实际上,什么是在执行 map 方法期间从 HDFS 加载文件的正确方法?)
看来我错过了关于基本原理的一些知识hadoop
、map/reduce
和hdfs
,但是您能否针对上述问题指出我,谢谢!
I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method,new FaceDetection(photoInputStream).isDog()
is all the underlying logic for the analysis.
Specifically,
Should I upload all of the photos to HDFS
? Assume yes,
how can I use them in the
map
method?Is it ok to make the input(to the
map
) as a text file containing all of the photo path(inHDFS
) with each a line, and in the map method, load the binary like:photoInputStream = getImageFromHDFS(photopath);
(Actually, what is the right method to load file from HDFS during the execution of the map method?)
It seems I miss some knowledges about the basic principle for hadoop
, map/reduce
and hdfs
, but can you please point me out in terms of the above question, Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
主要问题是每个文件都在一个文件中。因此,如果您有 10M 个文件,那么您将有 10M 个映射器,这听起来不太合理。您可能需要考虑将文件预序列化为
SequenceFiles
(每个键值对一个图像)。这将使数据加载到 MapReduce 作业中,因此您不必编写任何棘手的代码。此外,如果您愿意,您还可以将所有数据存储到一个 SequenceFile 中。 Hadoop 很好地处理了 SequenceFiles 的分割。基本上,它的工作方式是,您将有一个单独的 Java 进程,它获取多个图像文件,将光线字节读入内存,然后将数据存储到 SequenceFile 中的键值对中。继续并继续写入 HDFS。这可能需要一段时间,但您只需执行一次。
如果您有任何类型的合理集群(如果您正在考虑使用 Hadoop,则应该这样做)并且您实际上想要,这是不行的使用 Hadoop 的强大功能。您的 MapReduce 作业将启动并加载文件,但映射器将运行文本文件的本地数据,而不是图像!因此,基本上,您将在各处打乱图像文件,因为 JobTracker 不会将任务放置在文件所在的位置。这将产生大量的网络开销。如果您有 1TB 的图像,并且您有多个节点,那么您可以预期其中许多图像将通过网络进行流式传输。根据您的情况和集群大小(少于几个节点),这可能还不错。
如果您确实想这样做,可以使用
FileSystem
用于创建文件的 API(您需要open
方法)。The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering pre-serializing the files into
SequenceFiles
(one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.
This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).
If you do want to do this, you can use the
FileSystem
API to create files (you want theopen
method).假设将每个文件放入序列文件需要一秒钟的时间。将单个文件转换为序列文件大约需要 115 天。在单台机器上进行并行处理时,我没有看到太大的改进,因为磁盘读/写将成为读取照片文件和写入序列文件的瓶颈。查看这篇有关小文件问题的 Cloudera 文章。还引用了将 tar 文件转换为序列文件的脚本以及转换所需的时间。
基本上,照片必须以分布式方式进行处理,以将它们转换为序列。回到 Hadoop :)
根据 Hadoop - 权威指南
因此,直接加载 10M 的文件将需要大约 3,000 MB 的内存,用于在 NameNode 上存储命名空间。忘记在作业执行期间跨节点流式传输照片。
应该有更好的方法来解决这个问题。
另一种方法是将文件按原样加载到 HDFS 中并使用 CombineFileInputFormat 结合了小文件放入输入拆分中,并在计算输入拆分时考虑数据局部性。这种方法的优点是文件可以按原样加载到 HDFS 中,无需任何转换,并且节点之间也没有太多数据混洗。
Assuming it takes a sec to put each file into the sequence file. It will take ~115 days for the conversion of individual files into a sequence file. With parallel processing also on a single machine, I don't see much improvement because disk read/write will be a bottle neck with reading the photo files and writing the sequence file. Check this Cloudera article on small files problem. There is also a reference to a script which converts a tar file into a sequence file and how much time it took for the conversion.
Basically the photos have to be processed in a distributed way for converting them into sequence. Back to Hadoop :)
According to the Hadoop - The Definitive Guide
So, directly loading 10M of files will require around 3,000 MB of memory for just storing the namespace on the NameNode. Forget about streaming the photos across nodes during the execution of the job.
There should be a better way of solving this problem.
Another approach is to load the files as-is into HDFS and use CombineFileInputFormat which combines the small files into a input split and considers data locality while calculating the input splits. Advantage of this approach is that the files can be loaded into HDFS as-is without any conversion and there is also not much data shuffling across nodes.
不久前(2008 年?)我参与了一个项目,我们使用 Hadoop 做了一些非常相似的事情。我相信我们最初使用 HDFS 来存储图片,然后我们创建了一个文本文件,其中列出了要处理的文件。这个概念是,您使用 Map/Reduce 将文本文件分成多个部分并将其分布在云中,让每个节点根据它们收到的列表部分处理一些文件。抱歉,我不记得更明确的细节,但这是一般方法。
I was on a project a while back (2008?) where we did something very similar with Hadoop. I believe we initially used HDFS to store the pics, then we created a text file that listed the files to process. The concept is that you're using map/reduce to break the text file into pieces and spreading that out across the cloud, letting each node process some of the files based on the portion of the list that they receive. Sorry I don't remember more explicit details, but this was the general approach.