Disco/MapReduce:在分割数据上使用 chain_reader
我的算法当前使用 nr_reduces 1,因为我需要确保给定键的数据已聚合。
要将输入传递到下一次迭代,应该使用“chain_reader”。但是,映射器的结果是单个结果列表,这似乎意味着下一次映射迭代将作为单个映射器进行!有没有办法分割结果以触发多个映射器?
My algorithm currently uses nr_reduces 1 because I need to ensure that the data for a given key is aggregated.
To pass input to the next iteration, one should use "chain_reader". However, the results from a mapper are as a single result list, and it appears this means that the next map iteration takes place as a single mapper! Is there a way to split the results to trigger multiple mappers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我可以给出一个很长的答案,但由于这个问题已经有 3 年历史了:查看此页面: http://discoproject.org/doc/disco/howto/dataflow.html#single-partition-map
简而言之:当mapper函数有N个输入时,输出将为N通过设置
merge_partitions=False
你的reduce将输出N个blob。现在,如果您想生成比输入更多的输出,您可以传递partions=N
。但是,当您的迪斯科作业仅包含映射器函数并且您想要生成分区输出时,请添加最简单的归约 fase 与上述参数相结合以获得分区输出。I could give a long answer but since this question is 3 years old: check out this page: http://discoproject.org/doc/disco/howto/dataflow.html#single-partition-map
In short: When there is N input for the mapper function, the output will be N and by setting
merge_partitions=False
your reduce will output N blobs. Now if you want to generate more outputs than inputs you can passpartions=N
. But when your disco job consists of just a mapper function and you want to generate partitioned output, then add the simplest reduce fase combined with the params stated above to get that partitioned output.