较小规模的 Java 分布式编程
我正在更多地了解 hadoop 及其应用程序,并且我知道它适用于海量数据集和大文件。假设我有一个应用程序,其中处理的文件数量相对较少(比如 100k),这对于 hadoop/hdfs 之类的文件来说并不是一个很大的数字。但是,在单台机器上运行确实需要大量时间,因此我想分发该进程。
该问题可以分解为映射减少样式问题(例如,每个文件都可以独立处理,然后我可以聚合结果)。我愿意使用 Amazon EC2 等基础设施,但我不太确定要探索哪些技术来实际聚合该过程的结果。看起来 hadoop 在这里可能有点大材小用了。
任何人都可以提供有关此类问题的指导吗?
I'm learning a bit more about hadoop and its applications, and I understand it is geared toward massive datasets and large files. Let's say I had an application in which I was processing a relatively small number of files (say 100k), which isn't a huge number for something like hadoop/hdfs. However, it does take a macro amount of time to run on a single machine, so I'd like to distribute the process.
The problem can be broken down into a map reduce style problem (e.g. each of the files can be processed independently and then I can aggregate the results). I'm open to using infrastructure such as Amazon EC2, but I'm not so sure about what technologies to be exploring for actually aggregating the results of the process. Seems like hadoop might be a bit overkill here.
Can anyone provide guidance on this type of problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您可能需要重新考虑您无法合并文件的假设。甚至图像也可以组合 - 您只需要弄清楚如何以一种允许您在映射器中再次分解它们的方式来做到这一点。将它们与它们之间的某种哨兵值或幻数相结合可能会将它们变成一个巨大的文件。
其他选项包括 HBase,您可以在其中将图像存储在单元格中。 HBase 还具有内置的 TableMapper 和 TableReducer,并且可以以半结构化的方式将处理结果与原始数据一起存储。
编辑:至于“Hadoop 是否过度杀伤”问题,您需要考虑以下事项:
Hadoop 至少增加了一台机器的开销(HDFS NameNode)。您通常不想在该机器上存储数据或运行作业,因为它是 SPOF。
Hadoop 最适合批量处理数据,延迟相对较高。正如 @Raihan 提到的,如果您需要实时或低延迟结果,还有其他几种 FOSS 分布式计算架构可以更好地满足您的需求。
100k 文件并不是很少。即使每个文件有 100k,也有 10GB 的数据。
除上述之外,Hadoop 是一种处理分布式计算问题的开销相对较低的方法。它背后有一个庞大且乐于助人的社区,因此您可以在需要时快速获得帮助。而且它专注于在廉价硬件和免费操作系统上运行,因此实际上没有任何显着的开销。
简而言之,在你放弃它去做其他事情之前我会尝试一下。
First off, you may want to reconsider your assumption that you can't combine files. Even images can be combined- you just need to figure out how to do that in a way that allows you to break them out again in your mappers. Combining them with some sort of sentinel value or magic number between them might make it possible to turn them into one giant file.
Other options include HBase, where you could store the images in cells. HBase also has a built-in TableMapper and TableReducer, and can store the results of your processing alongside the raw data in a semi-structured way.
EDIT: As for the "is Hadoop overkill" question, you need to consider the following:
Hadoop adds at least one machine of overhead (the HDFS NameNode). You typically dont want to store data or run jobs on that machine, since it is a SPOF.
Hadoop is best suited for processing data in batch, with relatively high latency. As @Raihan mentions, there are several other FOSS distributed compute architectures that may server your needs better if you need realtime or low-latency results.
100k files isn't so very few. Even if they are 100k each, that's 10GB of data.
Other than the above, Hadoop is a relatively low-overhead way of approaching distributed computing problems. It has a huge, helpful community behind it, so you can get help quickly if you need it. And it is focused on running on cheap hardware and a free OS, so there really isnt any significant overhead.
In short, I'd try it before you discard it for something else.