使用 Amazon MapReduce/Hadoop 进行图像处理

发布于 2024-12-10 10:26:36 字数 400 浏览 0 评论 0原文

我有一个项目需要我处理大量(1000-10000)大(100MB 到 500MB)图像。我正在做的处理可以通过 Imagemagick 完成,但我希望实际上在 Amazon 的 Elastic MapReduce 平台(我相信该平台使用 Hadoop 运行)上进行此处理。

在我找到的所有示例中,它们都处理基于文本的输入(我发现字数统计采样了十亿次)。我找不到任何有关 Hadoop 的此类工作的信息:从一组文件开始,对每个文件执行相同的操作,然后将新文件的输出作为它自己的文件写出。

我非常确定这可以通过这个平台来完成,并且应该能够使用 Bash 来完成;我认为我不需要费力去创建一个完整的 Java 应用程序或其他东西,但我可能是错的。

我并不是要求有人给我代码,但如果有人有示例代码或处理类似问题的教程链接,我将不胜感激......

I have a project that requires me to process a lot (1000-10000) of big (100MB to 500MB) images. The processing I am doing can be done via Imagemagick, but I was hoping to actually do this processing on Amazon's Elastic MapReduce platform (which I believe runs using Hadoop).

Of all of the examples I have found, they all deal with text-based inputs (I have found that Word Count sample a billion times). I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.

I am pretty sure this can be done with this platform, and should be able to be done using Bash; I don't think I need to go to the trouble of creating a whole Java app or something, but I could be wrong.

I'm not asking for someone to hand me code, but if anyone has sample code or links to tutorials dealing with similar issues, it would be much appreciated...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我偏爱纯白色 2024-12-17 10:26:36

您的任务存在几个问题。

如您所见,Hadoop 本身并不处理图像。但是您可以将所有文件名和路径导出为文本文件并对其调用某些 Map 函数。因此,对本地磁盘上的文件调用 ImageMagick 应该不是什么大问题。

但是如何处理数据局部性?

您无法在 HDFS 中的文件上运行 ImageMagick(仅 Java API 和 FUSE 挂载不稳定),并且无法预测任务调度。例如,可以将地图任务安排到不存在图像的主机上。

当然,您可以只使用一台机器和一个任务。但这样你就没有任何进步。那么你就会有大量的开销。

当您从 Java 任务中退出时,还会出现内存问题。我写了一篇关于它的博客文章 [1]。

and should be able to be done using Bash

这是下一个问题,你至少必须编写地图任务。您需要一个 ProcessBuilder 来使用特定的路径和函数调用 ImageMagick。

我找不到任何有关 Hadoop 的此类工作的信息:开始
使用一组文件,对每个文件执行相同的操作,
然后将新文件的输出作为它自己的文件写出。

你猜为什么? :D Hadoop 不适合这项任务。

因此,基本上我建议手动将映像拆分到 EC2 中的多个主机,并在其上运行 bash 脚本。
它压力更小,速度更快。要在同一主机上并行化,请将文件拆分到每个核心的多个文件夹中,并在其上运行 bash 脚本。这应该可以很好地利用您的机器,并且比 Hadoop 更好。

[1] http://codingwiththomas.blogspot.com/ 2011/07/dealing-with-outofmemoryerror-in-hadoop.html

There are several problems with your task.

Hadoop does not natively process images as you've seen. But you can export all the file names and paths as a textfile and call some Map function on it. So calling ImageMagick on the files on local disk should not be a great deal.

But how do you deal with data locality?

You can't run ImageMagick on files in HDFS (only Java API and FUSE mount is not stable) and you can't predict the task scheduling. So for example a map task can be scheduled to a host where the image does not exists.

Sure you can simply use just a single machine and a single task. But then you don't have an improvement. You would then just have a bunch of overhead.

Also there is a memory problem when you shell out from a Java task. I made a blog post about it [1].

and should be able to be done using Bash

That is the next problem, you'd have to write the map task at least. You need a ProcessBuilder to call ImageMagick with a specific path and function.

I cannot find anything about this kind of work with Hadoop: starting
with a set of files, performing the same action to each of the files,
and then writing out the new file's output as it's own file.

Guess why? :D Hadoop is not the right thing for this task.

So basically I would recommend to manually split your images to multiple hosts in EC2 and run a bash script over it.
It is less stress and is faster. To parallize on the same host, split your files in multiple folders for each core and run the bash scripts over it. This should utilize your machine quite well, and better than Hadoop ever could.

[1] http://codingwiththomas.blogspot.com/2011/07/dealing-with-outofmemoryerror-in-hadoop.html

小耗子 2024-12-17 10:26:36

我想你可以看看《Hadoop:权威指南》第三版中的例子。附录 C 概述了一种在 bash 中获取文件(在 hdfs 中)、解压缩它、创建一个文件夹、从解压缩文件夹中的这些文件创建一个新文件的方法,然后将该文件放入另一个 hdfs 位置。

我自己定制了这个脚本,以便最初的 hadoop get 是对托管我需要的输入文件的网络服务器的curl 调用 - 我不想将所有文件放入 hdfs 中。如果您的文件已经在 hdfs 中,那么您可以使用注释掉的行。 hdfs get 或curl 将确保该文件在本地可用于该任务。这有很多网络开销。

不需要减少任务。

输入文件是用于转换/下载的文件的 URL 列表。

#!/usr/bin/env bash

# NLineInputFormat gives a single line: key is offset, value is Isotropic Url
read offset isofile

# Retrieve file from Isotropic server to local disk
echo "reporter:status:Retrieving $isofile" >&2
target=`echo $isofile | awk '{split($0,a,"/");print a[5] a[6]}'`
filename=$target.tar.bz2
#$HADOOP_INSTALL/bin/hadoop fs -get $isofile ./$filename
curl  $isofile -o $filename

# Un-bzip and un-tar the local file
mkdir -p $target
echo "reporter:status:Un-tarring $filename to $target" >&2
tar jxf $filename -C $target

# Take the file and do what you want with it. 
echo "reporter:status:Converting $target" >&2
imagemagick convert .... $target/$filename $target.all

# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | #$HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

《纽约时报》使用 Hadoop 在 24 小时内将 4TB 原始图像数据处理成 pdf。听起来他们采取了类似的方法:http://open.blogs.nytimes.com/2007/11/01/self-service-prolated-super-computing-fun/?scp=1&sq=self%20service%20prolated&st =cse。他们使用了java api,但剩下的就是在本地获取文件,对其进行处理,然后将其放回到hdfs/sc3中。

I would think you could look at the example in "Hadoop: The Definitive Guide" 3rd Edition. Appendix C outlines a way, in bash, to get a file (in hdfs), unzip it, create a folder, create a new file from those files in the unzipped folder and then put that file into another hdfs location.

I customized this script myself so that the initial hadoop get is a curl call to a webserver hosting the input files I need - I didn't want to put all the files in hdfs. If your files are already in hdfs then you can use the commented out line instead. The hdfs get or the curl will ensure the file is available locally for the task. There's lot of network overhead in this.

There's no need for a reduce task.

Input file is a list of the urls to files for conversion/download.

#!/usr/bin/env bash

# NLineInputFormat gives a single line: key is offset, value is Isotropic Url
read offset isofile

# Retrieve file from Isotropic server to local disk
echo "reporter:status:Retrieving $isofile" >&2
target=`echo $isofile | awk '{split($0,a,"/");print a[5] a[6]}'`
filename=$target.tar.bz2
#$HADOOP_INSTALL/bin/hadoop fs -get $isofile ./$filename
curl  $isofile -o $filename

# Un-bzip and un-tar the local file
mkdir -p $target
echo "reporter:status:Un-tarring $filename to $target" >&2
tar jxf $filename -C $target

# Take the file and do what you want with it. 
echo "reporter:status:Converting $target" >&2
imagemagick convert .... $target/$filename $target.all

# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | #$HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

The New York Times processed 4TB of raw image data into pdfs in 24 hours using Hadoop. It sounds like they took a similar approach: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/?scp=1&sq=self%20service%20prorated&st=cse. They used the java api, but the rest is get the file locally, process it and then stick it back into hdfs/sc3.

○闲身 2024-12-17 10:26:36

你可以看看Hadoop中的CombineFileInputFormat,它可以隐式地组合多个文件并根据文件进行拆分。

但我不确定你将如何处理 100M-500M 的图像,因为它相当大,实际上比 Hadoop 的分割大小还要大。也许您可以尝试不同的方法将一张图像分成几个部分。

无论如何,祝你好运。

You can take a look at CombineFileInputFormat in Hadoop, which can implicitly combine multiple files and split it, based on the files.

But I'm not sure how you gonna process the 100M-500M images, as it's quite big and in fact larger than the split size of Hadoop. Maybe you can try different approaches in splitting one image into several parts.

Anyway, good luck.

泛滥成性 2024-12-17 10:26:36

长期以来,我一直在寻找在 Hadoop 中处理大规模遥感图像的解决方案。而我到现在什么也没得到!

这是一个关于在 Hadoop 中将大规模图像分割为较小图像的开源项目。我仔细阅读了代码并对其进行了测试。但我发现表演并没有想象中那么好。无论如何,它可能会有所帮助并阐明问题。

妈祖计划:
http://www.cloudbook.net/directories/research -clouds/research-project.php?id=100057

祝你好运!

I've been looking for solutions to deal with large scale remote sensing image in Hadoop for a long time. And I got nothing till now!

Here is a open source project about spliting the large scale image into samller ones in Hadoop. I had read the code carefully and tested them. But I found that the performances are not as good as expectation. Anyway, it may be helpful and shed some light on the problem.

Project Matsu:
http://www.cloudbook.net/directories/research-clouds/research-project.php?id=100057

Good luck!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文