部分节点磁盘空间不足的Hadoop集群~
我现在拥有一个包含 12 个节点的集群。其中一些节点(特别是 8 个节点)具有足够的磁盘空间。但另外 4 个几乎没有空间可以使用。
然而,其他 4 个节点的 RAM 和 CPU 配置仍然较高。所以我的目的是利用资源。但现在,当我运行算法 SlopeOne 时,地图将输出如此多的中间数据并将它们存储在磁盘上。因此存在一些错误,我将其粘贴在此描述下。
我想知道:
- 如果一个节点发现本地无法存储数据,它会尝试将数据存储到其他有足够磁盘空间的节点吗?
- 如果单个节点本地存储数据失败,是否会重新开始工作?
- 如果一些磁盘空间足够的节点先完成了一个map作业,它会继续运行分配给磁盘空间不足的节点的作业吗?
- 我知道我可以设置一个参数,它可以限制本地空间的使用,如果一个节点超过该限制,jobtracker将不会向该节点提供更多作业。但这种方法会让节点停留在那里而不工作吗?
- 有什么建议可以让我利用这些资源并避免错误吗?
欣赏任何想法。
java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:71)
at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201104070658_0006/attempt_201104070658_0006_m_000000_0/output/spill897.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)
i'm now holding a cluster with 12 nodes. Some of them, specifically 8 nodes, have enough disk space. But the other 4 only have little space to use.
However, the other 4 nodes have still high RAM and CPU configuration. So my intention is to utilize the resources. But now, when i ran a algorithm, SlopeOne, the map will output so many intermediate data and store them on the disk. Thus there are some errors, which I pasted under this description.
I wonder:
- If a node find it can't store the data locally, will it try to store the data to other nodes with enough disk space?
- If a single node failed to store the data locally, will it start the work once again?
- If some nodes with enough disk space finish a map job first, will it continue to run the jobs which were distributed to the low disk space ones?
- I know I can set a parameter, which can limit the local space usage, and if a node exceed that limitation, the jobtracker won't give more jobs to that node. But will this method just keeps the node staying there without working?
- Any suggestions that I can utilize the resources as well as keep the errors away?
Appreciate any ideas.
java.io.IOException: Spill failed
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:71)
at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201104070658_0006/attempt_201104070658_0006_m_000000_0/output/spill897.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以尝试减少复制数量,如这个问题HDFS减少复制因子中的回答
默认复制因子为 3
You could try reducing the number of replications, as answered in this question HDFS Reduced Replication Factor
The default replication factor is 3