Hadoop Pipes:如何将大数据记录传递给map/reduce任务
我正在尝试使用map/reduce
来处理大量的二进制数据。该应用程序具有以下特点:记录数量可能很大,因此我不想将每个记录作为单独的文件存储在 HDFS 中(我计划将它们全部连接到单个二进制序列文件),并且每个记录都是一个大的连贯(即不可分割)blob,大小在 1 到数百 MB 之间。这些记录将由 C++ 可执行文件使用和处理。如果不是考虑到记录的大小,Hadoop Pipes API 就可以了:但这似乎是基于将输入作为连续的字节块传递给 Map/Reduce 任务,在这种情况下这是不切实际的。
我不确定执行此操作的最佳方法。是否存在任何类型的缓冲接口,允许每个 M/R 任务以可管理的块的形式提取多个数据块?否则,我正在考虑通过 API 传递文件偏移量,并在 C++ 端从 HDFS 流式传输原始数据。
我想听听任何尝试过类似方法的人的意见 - 我对 hadoop 还很陌生。
I'm trying to use map/reduce
to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS
(I was planning to concatenate them all into a single binary sequence file), and each record is a large coherent (i.e. non-splittable) blob, between one and several hundred MB in size. The records will be consumed and processed by a C++ executable. If it weren't for the size of the records, the Hadoop Pipes API would be fine: but this seems to be based around passing the input to map/reduce tasks as a contiguous block of bytes, which is impractical in this case.
I'm not sure of the best way to do this. Does any kind of buffered interface exist that would allow each M/R task to pull multiple blocks of data in manageable chunks? Otherwise I'm thinking of passing file offsets via the API and streaming in the raw data from HDFS on the C++ side.
I'd like to have any opinions from anyone who's tried anything similar - I'm pretty new to hadoop.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Hadoop 不适用于大小约为 100MB 的记录。你会得到 OutOfMemoryError 和不均匀的分割,因为有些记录是 1MB,有些是 100MB。根据艾哈姆达尔定律,您的并行性将受到极大影响,从而降低吞吐量。
我看到两个选择。您可以使用 Hadoop 流 将大文件映射到 C++ 中按原样可执行。由于这将通过标准输入发送您的数据,因此它自然会被流式传输和缓冲。您的第一个映射任务必须将数据分解为更小的记录以进行进一步处理。然后进一步的任务对较小的记录进行操作。
如果您确实无法分解它,请让您的映射缩减作业对文件名进行操作。第一个映射器获取一些文件名,通过映射器 C++ 可执行文件运行它们,将它们存储在更多文件中。为reducer 提供输出文件的所有名称,并使用reducer C++ 可执行文件重复该操作。这不会耗尽内存,但会很慢。除了并行性问题之外,您不会将reduce作业调度到已经拥有数据的节点上,从而导致非本地HDFS读取。
Hadoop is not designed for records about 100MB in size. You will get OutOfMemoryError and uneven splits because some records are 1MB and some are 100MB. By Ahmdal's Law your parallelism will suffer greatly, reducing throughput.
I see two options. You can use Hadoop streaming to map your large files into your C++ executable as-is. Since this will send your data via stdin it will naturally be streaming and buffered. Your first map task must break up the data into smaller records for further processing. Further tasks then operate on the smaller records.
If you really can't break it up, make your map reduce job operate on file names. The first mapper gets some file names, runs them thorough your mapper C++ executable, stores them in more files. The reducer is given all the names of the output files, repeat with a reducer C++ executable. This will not run out of memory but it will be slow. Besides the parallelism issue you won't get reduce jobs scheduled onto nodes that already have the data, resulting in non-local HDFS reads.