使用 Amazon MapReduce/Hadoop 进行图像处理

发布于 2024-12-10 10:26:36 字数 400 浏览 6 评论 0原文

我有一个项目需要我处理大量（1000-10000）大（100MB 到 500MB）图像。我正在做的处理可以通过 Imagemagick 完成，但我希望实际上在 Amazon 的 Elastic MapReduce 平台（我相信该平台使用 Hadoop 运行）上进行此处理。

在我找到的所有示例中，它们都处理基于文本的输入（我发现字数统计采样了十亿次）。我找不到任何有关 Hadoop 的此类工作的信息：从一组文件开始，对每个文件执行相同的操作，然后将新文件的输出作为它自己的文件写出。

我非常确定这可以通过这个平台来完成，并且应该能够使用 Bash 来完成；我认为我不需要费力去创建一个完整的 Java 应用程序或其他东西，但我可能是错的。

我并不是要求有人给我代码，但如果有人有示例代码或处理类似问题的教程链接，我将不胜感激......

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我偏爱纯白色 2024-12-17 10:26:36

您的任务存在几个问题。

如您所见，Hadoop 本身并不处理图像。但是您可以将所有文件名和路径导出为文本文件并对其调用某些 Map 函数。因此，对本地磁盘上的文件调用 ImageMagick 应该不是什么大问题。

但是如何处理数据局部性？

您无法在 HDFS 中的文件上运行 ImageMagick（仅 Java API 和 FUSE 挂载不稳定），并且无法预测任务调度。例如，可以将地图任务安排到不存在图像的主机上。

当然，您可以只使用一台机器和一个任务。但这样你就没有任何进步。那么你就会有大量的开销。

当您从 Java 任务中退出时，还会出现内存问题。我写了一篇关于它的博客文章 [1]。

and should be able to be done using Bash

这是下一个问题，你至少必须编写地图任务。您需要一个 ProcessBuilder 来使用特定的路径和函数调用 ImageMagick。

我找不到任何有关 Hadoop 的此类工作的信息：开始
使用一组文件，对每个文件执行相同的操作，
然后将新文件的输出作为它自己的文件写出。

你猜为什么？ :D Hadoop 不适合这项任务。

因此，基本上我建议手动将映像拆分到 EC2 中的多个主机，并在其上运行 bash 脚本。
它压力更小，速度更快。要在同一主机上并行化，请将文件拆分到每个核心的多个文件夹中，并在其上运行 bash 脚本。这应该可以很好地利用您的机器，并且比 Hadoop 更好。

[1] http://codingwiththomas.blogspot.com/ 2011/07/dealing-with-outofmemoryerror-in-hadoop.html

There are several problems with your task.

Hadoop does not natively process images as you've seen. But you can export all the file names and paths as a textfile and call some Map function on it. So calling ImageMagick on the files on local disk should not be a great deal.

But how do you deal with data locality?

You can't run ImageMagick on files in HDFS (only Java API and FUSE mount is not stable) and you can't predict the task scheduling. So for example a map task can be scheduled to a host where the image does not exists.

Sure you can simply use just a single machine and a single task. But then you don't have an improvement. You would then just have a bunch of overhead.

Also there is a memory problem when you shell out from a Java task. I made a blog post about it [1].

and should be able to be done using Bash

That is the next problem, you'd have to write the map task at least. You need a ProcessBuilder to call ImageMagick with a specific path and function.

I cannot find anything about this kind of work with Hadoop: starting
with a set of files, performing the same action to each of the files,
and then writing out the new file's output as it's own file.

Guess why? :D Hadoop is not the right thing for this task.

So basically I would recommend to manually split your images to multiple hosts in EC2 and run a bash script over it.
It is less stress and is faster. To parallize on the same host, split your files in multiple folders for each core and run the bash scripts over it. This should utilize your machine quite well, and better than Hadoop ever could.

[1] http://codingwiththomas.blogspot.com/2011/07/dealing-with-outofmemoryerror-in-hadoop.html

回复收藏 0 原文

小耗子 2024-12-17 10:26:36

我想你可以看看《Hadoop：权威指南》第三版中的例子。附录 C 概述了一种在 bash 中获取文件（在 hdfs 中）、解压缩它、创建一个文件夹、从解压缩文件夹中的这些文件创建一个新文件的方法，然后将该文件放入另一个 hdfs 位置。

我自己定制了这个脚本，以便最初的 hadoop get 是对托管我需要的输入文件的网络服务器的curl 调用 - 我不想将所有文件放入 hdfs 中。如果您的文件已经在 hdfs 中，那么您可以使用注释掉的行。 hdfs get 或curl 将确保该文件在本地可用于该任务。这有很多网络开销。

不需要减少任务。

输入文件是用于转换/下载的文件的 URL 列表。

#!/usr/bin/env bash

# NLineInputFormat gives a single line: key is offset, value is Isotropic Url
read offset isofile

# Retrieve file from Isotropic server to local disk
echo "reporter:status:Retrieving $isofile" >&2
target=`echo $isofile | awk '{split($0,a,"/");print a[5] a[6]}'`
filename=$target.tar.bz2
#$HADOOP_INSTALL/bin/hadoop fs -get $isofile ./$filename
curl  $isofile -o $filename

# Un-bzip and un-tar the local file
mkdir -p $target
echo "reporter:status:Un-tarring $filename to $target" >&2
tar jxf $filename -C $target

# Take the file and do what you want with it. 
echo "reporter:status:Converting $target" >&2
imagemagick convert .... $target/$filename $target.all

# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | #$HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

《纽约时报》使用 Hadoop 在 24 小时内将 4TB 原始图像数据处理成 pdf。听起来他们采取了类似的方法：http://open.blogs.nytimes.com/2007/11/01/self-service-prolated-super-computing-fun/?scp=1&sq=self%20service%20prolated&st =cse。他们使用了java api，但剩下的就是在本地获取文件，对其进行处理，然后将其放回到hdfs/sc3中。

I would think you could look at the example in "Hadoop: The Definitive Guide" 3rd Edition. Appendix C outlines a way, in bash, to get a file (in hdfs), unzip it, create a folder, create a new file from those files in the unzipped folder and then put that file into another hdfs location.

I customized this script myself so that the initial hadoop get is a curl call to a webserver hosting the input files I need - I didn't want to put all the files in hdfs. If your files are already in hdfs then you can use the commented out line instead. The hdfs get or the curl will ensure the file is available locally for the task. There's lot of network overhead in this.

There's no need for a reduce task.

Input file is a list of the urls to files for conversion/download.

#!/usr/bin/env bash

# NLineInputFormat gives a single line: key is offset, value is Isotropic Url
read offset isofile

# Retrieve file from Isotropic server to local disk
echo "reporter:status:Retrieving $isofile" >&2
target=`echo $isofile | awk '{split($0,a,"/");print a[5] a[6]}'`
filename=$target.tar.bz2
#$HADOOP_INSTALL/bin/hadoop fs -get $isofile ./$filename
curl  $isofile -o $filename

# Un-bzip and un-tar the local file
mkdir -p $target
echo "reporter:status:Un-tarring $filename to $target" >&2
tar jxf $filename -C $target

# Take the file and do what you want with it. 
echo "reporter:status:Converting $target" >&2
imagemagick convert .... $target/$filename $target.all

# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | #$HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

The New York Times processed 4TB of raw image data into pdfs in 24 hours using Hadoop. It sounds like they took a similar approach: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/?scp=1&sq=self%20service%20prorated&st=cse. They used the java api, but the rest is get the file locally, process it and then stick it back into hdfs/sc3.

回复收藏 0 原文