Hadoop mysql 限制减速器

发布于 2024-11-07 19:42:47 字数 272 浏览 1 评论 0原文

我正在使用 hadoop 更新 mysql 数据库中的一些记录... 我看到的问题是,在某些情况下,会为同一键集启动多个减速器。 我见过最多 2 个减速器在不同的从属设备上运行相同的密钥。 这导致两个减速器更新数据库中相同记录的问题。

我正在考虑关闭自动提交模式来缓解这个问题...... 但是并作为减速器中“清理”操作的一部分进行提交,但想知道如何处理落后的减速器......是否仍会调用清理操作......如果是这样。 ...有没有办法判断减速器是否正常完成,因为我想对未完全完成数据处理的减速器调用“回滚”?

I'm using hadoop to update some records in a mysql db...
The issue that I'm seeing is that in certain cases, multiple reducers are launched for the same key set.
I've seen up to 2 reducers running on different slaves for the same key.
This leads to the issue of both reducers updating the same record in the db.

I was thinking of turning off the autocommit mode to alleviate this issue....
but and doing the commit as part of the "cleanup" operation in the reducer, but was wondering what to do with the reducer(s) that lag behind...would the cleanup operation still be called for that...if so....is there a way to tell if the reducer finished normally or not, since I'd like to call "rollback" on the reducer(s) that didn't finish processing the data entirely?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

腹黑女流氓 2024-11-14 19:42:47

您可以添加以下 MapReduceJob 属性:

mapred.map.tasks.speculative.execution

值为 false。这将关闭推测执行。

You can add following Map Reduce Job property:

mapred.map.tasks.speculative.execution

with value as false. This will turn off speculative execution.

烈酒灼喉 2024-11-14 19:42:47

有两件事:

  1. 我真的怀疑reduce中的两个(相等)键是否被分区到不同的从属设备。由于使用了HashPartitioner。您应该重写键类上的 hashCode。
  2. 您可以选择设置减少任务的数量。这可以通过 API 调用 Job.setNumReduceTasks(X) 来完成。显然你可以将其设置为 1。

Two things:

  1. I really doubt that two (EQUAL) keys inside a reduce get partitioned to different slaves. Since HashPartitioner is used. You should override hashCode on your key class.
  2. You have the option to set the number of reduce tasks. It can be done with an API call to Job.setNumReduceTasks(X). Obviously you can set this to 1.
鹿! 2024-11-14 19:42:47

一般来说(不知道您的用例)通常最好避免 Hadoop 的“副作用”。这基本上依赖于 Hadoop 之外的第 3 方系统,因为它可能会成为性能瓶颈,并可能因线程而导致系统崩溃。我建议您研究 Cloudera 的 Sqoop,以便在 Map-Reduce 作业完成后进行批量加载。我使用它作为散装装载机取得了很好的成功。

Sqoop 文档

如果您仍想直接从 Hadoop 建立索引。您可以使用公平调度程序来限制可以随时运行的映射器或减速器的数量。将 mapred.queue.name 设置为速率限制队列来启动作业。您正在寻找 maxMaps / maxReduces 参数。

公平调度程序文档

In general (without knowing your use case) it's usually preferable to avoid "Side Effect" with Hadoop. This is basically relying on a 3rd party system outside of Hadoop as it can bottleneck your performance and potentially topple the system over due to threading. I would recommend that you investigate Sqoop from Cloudera to do a batch load after the map-reduce job is complete. I have had good success using this as a bulk loader.

Sqoop Documentation

If you still would like to index directly from Hadoop. you can use the fair-scheduler to rate limit the number of mappers or reducers that can run at any time. Start the job with the mapred.queue.name set to your rate limited queue. You are looking for the maxMaps / maxReduces parameter.

Fair Scheduler Documentation

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文