Disco/MapReduce:在分割数据上使用 chain_reader

发布于 2024-08-28 05:25:54 字数 146 浏览 10 评论 0原文

我的算法当前使用 nr_reduces 1,因为我需要确保给定键的数据已聚合。

要将输入传递到下一次迭代,应该使用“chain_reader”。但是,映射器的结果是单个结果列表,这似乎意味着下一次映射迭代将作为单个映射器进行!有没有办法分割结果以触发多个映射器?

My algorithm currently uses nr_reduces 1 because I need to ensure that the data for a given key is aggregated.

To pass input to the next iteration, one should use "chain_reader". However, the results from a mapper are as a single result list, and it appears this means that the next map iteration takes place as a single mapper! Is there a way to split the results to trigger multiple mappers?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

安人多梦 2024-09-04 05:25:54

我可以给出一个很长的答案,但由于这个问题已经有 3 年历史了:查看此页面: http://discoproject.org/doc/disco/howto/dataflow.html#single-partition-map

简而言之:当mapper函数有N个输入时,输出将为N通过设置 merge_partitions=False 你的reduce将输出N个blob。现在,如果您想生成比输入更多的输出,您可以传递 partions=N。但是,当您的迪斯科作业仅包含映射器函数并且您想要生成分区输出时,请添加最简单的归约 fase 与上述参数相结合以获得分区输出。

@staticmethod
def reduce(iter, out, params):
    for (key, value) in iter:
        out.add(key, value)

I could give a long answer but since this question is 3 years old: check out this page: http://discoproject.org/doc/disco/howto/dataflow.html#single-partition-map

In short: When there is N input for the mapper function, the output will be N and by setting merge_partitions=False your reduce will output N blobs. Now if you want to generate more outputs than inputs you can pass partions=N. But when your disco job consists of just a mapper function and you want to generate partitioned output, then add the simplest reduce fase combined with the params stated above to get that partitioned output.

@staticmethod
def reduce(iter, out, params):
    for (key, value) in iter:
        out.add(key, value)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文