部分节点磁盘空间不足的Hadoop集群~

发布于 2024-10-31 04:35:29 字数 1921 浏览 1 评论 0原文

我现在拥有一个包含 12 个节点的集群。其中一些节点(特别是 8 个节点)具有足够的磁盘空间。但另外 4 个几乎没有空间可以使用。

然而,其他 4 个节点的 RAM 和 CPU 配置仍然较高。所以我的目的是利用资源。但现在,当我运行算法 SlopeOne 时,地图将输出如此多的中间数据并将它们存储在磁盘上。因此存在一些错误,我将其粘贴在此描述下。

我想知道:

  1. 如果一个节点发现本地无法存储数据,它会尝试将数据存储到其他有足够磁盘空间的节点吗?
  2. 如果单个节点本地存储数据失败,是否会重新开始工作?
  3. 如果一些磁盘空间足够的节点先完成了一个map作业,它会继续运行分配给磁盘空间不足的节点的作业吗?
  4. 我知道我可以设置一个参数,它可以限制本地空间的使用,如果一个节点超过该限制,jobtracker将不会向该节点提供更多作业。但这种方法会让节点停留在那里而不工作吗?
  5. 有什么建议可以让我利用这些资源并避免错误吗?

欣赏任何想法。

java.io.IOException: Spill failed
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860)
    at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
    at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:71)
    at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:1)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201104070658_0006/attempt_201104070658_0006_m_000000_0/output/spill897.out
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
    at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)

i'm now holding a cluster with 12 nodes. Some of them, specifically 8 nodes, have enough disk space. But the other 4 only have little space to use.

However, the other 4 nodes have still high RAM and CPU configuration. So my intention is to utilize the resources. But now, when i ran a algorithm, SlopeOne, the map will output so many intermediate data and store them on the disk. Thus there are some errors, which I pasted under this description.

I wonder:

  1. If a node find it can't store the data locally, will it try to store the data to other nodes with enough disk space?
  2. If a single node failed to store the data locally, will it start the work once again?
  3. If some nodes with enough disk space finish a map job first, will it continue to run the jobs which were distributed to the low disk space ones?
  4. I know I can set a parameter, which can limit the local space usage, and if a node exceed that limitation, the jobtracker won't give more jobs to that node. But will this method just keeps the node staying there without working?
  5. Any suggestions that I can utilize the resources as well as keep the errors away?

Appreciate any ideas.

java.io.IOException: Spill failed
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860)
    at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
    at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:71)
    at slopeone.SlopeOneTrainer$SlopeOneTrainMapper.map(SlopeOneTrainer.java:1)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201104070658_0006/attempt_201104070658_0006_m_000000_0/output/spill897.out
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
    at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

世界等同你 2024-11-07 04:35:29

您可以尝试减少复制数量,如这个问题HDFS减少复制因子中的回答
默认复制因子为 3

You could try reducing the number of replications, as answered in this question HDFS Reduced Replication Factor
The default replication factor is 3

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文