Hadoop DistributedCache 无法报告状态
在 Hadoop 作业中,我映射多个 XML 文件并过滤每个元素的 ID (来自
。由于我想将作业限制为一组特定的 ID,因此我读取了一个大文件(2.7 GB 中大约有 2.5 亿行,每行仅用一个整数作为 ID)。因此,我使用 DistributedCache,使用 BufferedReader 解析 Mapper 的 setup() 方法中的文件,并将 ID 保存到 HashSet。
现在,当我开始工作时,我收到无数
Task attempts_201201112322_0110_m_000000_1 failed to report status。杀死!
在执行任何地图作业之前。
该集群由 40 个节点组成,并且由于在执行作业的任何任务之前将 DistributedCache
的文件复制到从属节点,因此我假设故障是由大型 HashSet
引起的代码>.我已经将 mapred.task.timeout
增加到 2000 秒。当然我可以把时间加长一些,但其实这个时间应该足够了不是吗?
由于DistributedCache被用来作为一种“有效地分发大型只读文件”的方法,我想知道是什么导致了这里的失败,以及是否有另一种方法可以将相关ID传递给每个地图作业?
In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags)
. Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup()
method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache
are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet
. I have already increased the mapred.task.timeout
to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's
are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在您的设置方法中添加一些调试 println 以检查该方法是否超时(记录进入和退出时间)?
您可能还想考虑使用 BloomFilter 来保存 ID。您可以将这些值存储在具有良好误报率 (~0.5%) 的 50MB 布隆过滤器中,然后运行辅助作业来执行分区检查与实际参考文件相对应。
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.