使用 Hadoop MapReduce 进行并行缩减
我正在使用 Hadoop 的 MapReduce。我有一个文件作为地图函数的输入,地图函数做了一些事情(与问题无关)。我希望我的减速器能够获取地图的输出并写入两个不同的文件。 在我看来(我想要一个有效的解决方案),我想到了两种方法:
- 1 个减速器,它知道识别不同的情况并写入 2 个不同的上下文。
- 2 个并行的减速器,每个减速器都会知道识别其相关输入,忽略另一个的输入,这样每个减速器都会写入一个文件(每个减速器都会写入不同的文件)。
我更喜欢第一个解决方案,因为这意味着我只会检查地图的输出一次而不是并行两次 - 但如果第一个在某种程度上不受支持 - 我会很高兴听到一个解决方案第二个建议。
*注意:这两个最终文件应该是分开的,此时无需将它们连接起来。
I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
- 1 reducer which will know to identify to different cases and write to 2 different contexts.
- 2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Hadoop API 具有创建多个输出的功能,称为 MultipleOutputs 这使您的首选解决方案成为可能。
The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.
如果您在映射阶段知道记录必须转到哪个文件,则可以使用特殊键来标记映射输出,指定记录应转到哪个文件。例如,如果记录 R1 必须转到文件 1,则将输出 <1, R1>..(1 是键.. file1 的符号表示,R1 是值)如果记录 R2 必须转到文件2,您的地图输出将是 <2, R2>。
然后,如果您将映射缩减作业配置为仅使用 2 个缩减程序..它将保证所有标记有 <1, _> 的记录将被发送到1个reducer和<2,_>;将被发送给对方。
这会比您首选的解决方案更好,因为您仍然只需要一次地图输出......同时,它将是并行的。
If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.