Hadoop 用于处理非常大的二进制文件
我有一个希望分发的系统,其中有许多非常大的不可分割的二进制文件,我希望以分布式方式处理。这些大小约为数百 Gb。由于各种固定的、特定于实现的原因,这些文件无法并行处理,而必须由同一进程按顺序处理直至结束。
该应用程序是用 C++ 开发的,因此我会考虑使用 Hadoop 管道来流式传输数据。每个实例都需要顺序处理其自身数据(当前存储在一个文件中)的 100Gb 到 200Gb 量级,并且应用程序当前(可能)受到 IO 限制,因此每个作业完全在本地运行非常重要。
我非常热衷于使用 HDFS 来托管这些数据 - 自动维护冗余副本并在添加新节点时重新平衡的能力将非常有用。我还热衷于 MapReduce,因为它计算简单,并且要求托管计算尽可能靠近数据。然而,我想知道 Hadoop 对于这个特定的应用程序有多合适。
我知道,为了表示我的数据,可以生成不可分割的文件,或者生成巨大的序列文件(在我的例子中,单个文件的大小约为 10Tb - 我应该将所有数据打包到一)。因此可以使用 Hadoop 处理我的数据。然而,我的模型似乎不太适合 Hadoop:社区同意吗?或者有关于以最佳方式布置这些数据的建议吗?或者甚至对于其他可能更适合该模型的集群计算系统?
这个问题可能与 hadoop 上的现有问题重复,但例外的是我的系统需要每个单个文件一个数量级或两个以上的数据(之前我见过关于大小为几 GB 的单个文件的问题) 。因此,如果之前已经回答过这个问题,请原谅我 - 即使对于这种大小的数据。
谢谢,
亚历克斯
I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the order of a couple of hundreds of Gb. For a variety of fixed, implementation specific reasons, these files cannot be processed in parallel but have to be processed sequentially by the same process through to the end.
The application is developed in C++ so I would be considering Hadoop pipes to stream the data in and out. Each instance will need to process of the order of 100Gb to 200Gb sequentially of its own data (currently stored in one file), and the application is currently (probably) IO limited so it's important that each job is run entirely locally.
I'm very keen on HDFS for hosting this data - the ability to automatically maintain redundant copies and to rebalance as new nodes are added will be very useful. I'm also keen on map reduce for its simplicity of computation and its requirement to host the computation as close as possible to the data. However, I'm wondering how suitable Hadoop is for this particular application.
I'm aware that for representing my data it's possible to generate non-splittable files, or alternatively to generate huge sequence files (in my case, these would be of the order of 10Tb for a single file - should I pack all my data into one). And that it's therefore possible to process my data using Hadoop. However it seems like my model doesn't fit Hadoop that well: does the community agree? Or have suggestions for laying this data out optimally? Or even for other cluster computing systems that might fit the model better?
This question is perhaps a duplicate of existing questions on hadoop, but with the exception that my system requires an order of magnitude or two more data per individual file (previously I've seen the question asked about individual files of a few Gb in size). So forgive me if this has been answered before - even for this size of data.
Thanks,
Alex
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您似乎正在处理相对较少的大文件。由于文件很大且不可分割,Hadoop 将难以在集群中有效地调度和分配作业。我认为一批处理的文件越多(例如数百个),使用 Hadoop 的价值就越大。
由于您只处理几个文件,您是否尝试过更简单的分发机制,例如使用 ssh 或 GNU 并行?我使用这种方法来完成简单的任务并取得了很大的成功。在所有节点上使用 NFS 安装驱动器可以共享限制您必须执行的复制量。
It seems like you are working with relatively few numbers of large files. Since your files are huge and not splittable, Hadoop will have trouble scheduling and distributing jobs effectively across the cluster. I think the more files that you process in one batch (like hundreds), the more worth while it will be to use Hadoop.
Since you're only working with a few files, have you tried a simpler distribution mechanism, like launching processes on multiple machines using ssh, or GNU Parallel? I've had a lot of success using this approach for simple tasks. Using a NFS mounted drive on all your nodes can share limits the amount of copying you would have to do as well.
您可以为您的文件编写一个自定义的InputSplit,但正如bajafresh4life所说,这并不是真正理想的,因为除非您的HDFS块大小与您的文件大小相同,否则您的文件将分散在各处,并且会产生网络开销。或者,如果您确实使 HDFS 大小与文件大小匹配,那么您将无法获得所有集群磁盘的优势。最重要的是,Hadoop 可能不是最适合您的工具。
You can write a custom InputSplit for your file, but as bajafresh4life said it won't really be ideal because unless your HDFS chunk size is the same as your file size your files are going to be spread all around and there will be network overhead. Or if you do make your HDFS size match your file size then you're not getting the benefit of all your cluster's disks. Bottom line is that Hadoop may not be the best tool for you.