地图减少按日期和类别划分的标签计数

发布于 2024-11-17 23:05:02 字数 470 浏览 0 评论 0原文

我仍在尝试将我的大脑集中在地图缩减上。我有一组文章,每篇文章都属于一个类别,每篇文章都有一组关键字。假设该文档如下所示:

{
  author: "kris",
  category: "mongodb",
  content: "...",
  keywords: [ "keyword1", "keyword2", "keyword3" ],
  created_at: "..."
}

我想从所有文档中提取对于作者而言重要的关键字,因此我最终会得到如下结果:

{
  author: "kris",
  categories: {
    mongodb: { keyword1: 5, keyword2: 3, keyword3: 1 },
    ruby: { ... },
    python: { ... }
  }
}

对此的任何输入将不胜感激。

谢谢!

I am still trying to wrap my brain around map reduce. I have a collection of articles, each of which belongs to one category, and each article has a set of keywords. Assuming that the document looks like this:

{
  author: "kris",
  category: "mongodb",
  content: "...",
  keywords: [ "keyword1", "keyword2", "keyword3" ],
  created_at: "..."
}

I want to essentially pull from all documents the keyword counts, in respect to the author, so I end up with something like:

{
  author: "kris",
  categories: {
    mongodb: { keyword1: 5, keyword2: 3, keyword3: 1 },
    ruby: { ... },
    python: { ... }
  }
}

Any input on this would be greatly appreciated.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

愁杀 2024-11-24 23:05:02

哦,你的问题让我多么兴奋!这实际上是我分布式系统课程的最后一个作业的一部分,所以它对我刚毕业的我来说相当新鲜。

对于解析细节,我只是 google Apache 的 Hadoop 教程,但我会给你一个总体概述。

基本上,这个问题需要两个 Map-Reduce 阶段。在第一个映射中,您的输入应该是 键值对列表(可能需要对文件进行一些预处理,但没什么大不了的) 。对于这些对中的每一对,您输出 作为要传递给减速器的对(您基本上是说每个单词都应该计算一次)。

在第一个reduce过程中,前面的键值对将被方便地压缩,以便每个关键字都有自己的形式对 < /code>,其中 1 的数量代表该单词在所有文档中出现的次数。因此,您只需对 1 进行求和并输出

最后的映射/归约阶段只是按关键字的值对关键字进行排序。映射: --> <总和,关键字>减少:<总和,{关键字}> --> <关键字,总和>。这利用了映射缩减在传递到缩减阶段时按键排序的事实。

现在,所有关键字都按字数统计排列!

Oh, how thrilled I am by your question! This was actually part of my last assignment for my distributed systems class, so its quite fresh in my recently-graduated mind.

For the parsing details, I'd just google Apache's Hadoop tutorial, but I'll give you the general overview.

Basically, this problem requires two Map-Reduce Phases. In the first map, your input should be a list of <filename, {list of keywords}> key-value pairs (might have to do a lil preprocessing on your files, but no biggie). For each of these pairs, you output <keyword, 1> as the pair to be handed to the reducer (your basically saying every word should be counted once).

In the first reduce pass, the previous key-value pairs will conveniently be condensed so that each keyword has its own pair of the form <keyword, {1,1,1,1,1,1}>, with the number of 1s representing the number of times the word appears throughout all of the documents. So you just sum up the 1s and output <keyword, sum>.

The final map/reduce phase is just to sort the keywords by their value. Map: <keyword,sum> --> <sum,keyword> Reduce: <sum, {keywords}> --> <keyword,sum>. This exploits the fact that map-reduce sorts by key when passes to the reduce phase.

Now all of the keywords are next to their word count in sorted order!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文