Hadoop mysql 限制减速器
我正在使用 hadoop 更新 mysql 数据库中的一些记录... 我看到的问题是,在某些情况下,会为同一键集启动多个减速器。 我见过最多 2 个减速器在不同的从属设备上运行相同的密钥。 这导致两个减速器更新数据库中相同记录的问题。
我正在考虑关闭自动提交模式来缓解这个问题...... 但是并作为减速器中“清理”操作的一部分进行提交,但想知道如何处理落后的减速器......是否仍会调用清理操作......如果是这样。 ...有没有办法判断减速器是否正常完成,因为我想对未完全完成数据处理的减速器调用“回滚”?
I'm using hadoop to update some records in a mysql db...
The issue that I'm seeing is that in certain cases, multiple reducers are launched for the same key set.
I've seen up to 2 reducers running on different slaves for the same key.
This leads to the issue of both reducers updating the same record in the db.
I was thinking of turning off the autocommit mode to alleviate this issue....
but and doing the commit as part of the "cleanup" operation in the reducer, but was wondering what to do with the reducer(s) that lag behind...would the cleanup operation still be called for that...if so....is there a way to tell if the reducer finished normally or not, since I'd like to call "rollback" on the reducer(s) that didn't finish processing the data entirely?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以添加以下 MapReduceJob 属性:
值为 false。这将关闭推测执行。
You can add following Map Reduce Job property:
with value as false. This will turn off speculative execution.
有两件事:
Job.setNumReduceTasks(X)
来完成。显然你可以将其设置为 1。Two things:
Job.setNumReduceTasks(X)
. Obviously you can set this to 1.一般来说(不知道您的用例)通常最好避免 Hadoop 的“副作用”。这基本上依赖于 Hadoop 之外的第 3 方系统,因为它可能会成为性能瓶颈,并可能因线程而导致系统崩溃。我建议您研究 Cloudera 的 Sqoop,以便在 Map-Reduce 作业完成后进行批量加载。我使用它作为散装装载机取得了很好的成功。
Sqoop 文档
如果您仍想直接从 Hadoop 建立索引。您可以使用公平调度程序来限制可以随时运行的映射器或减速器的数量。将 mapred.queue.name 设置为速率限制队列来启动作业。您正在寻找 maxMaps / maxReduces 参数。
公平调度程序文档
In general (without knowing your use case) it's usually preferable to avoid "Side Effect" with Hadoop. This is basically relying on a 3rd party system outside of Hadoop as it can bottleneck your performance and potentially topple the system over due to threading. I would recommend that you investigate Sqoop from Cloudera to do a batch load after the map-reduce job is complete. I have had good success using this as a bulk loader.
Sqoop Documentation
If you still would like to index directly from Hadoop. you can use the fair-scheduler to rate limit the number of mappers or reducers that can run at any time. Start the job with the mapred.queue.name set to your rate limited queue. You are looking for the maxMaps / maxReduces parameter.
Fair Scheduler Documentation