地图减少按日期和类别划分的标签计数
我仍在尝试将我的大脑集中在地图缩减上。我有一组文章,每篇文章都属于一个类别,每篇文章都有一组关键字。假设该文档如下所示:
{
author: "kris",
category: "mongodb",
content: "...",
keywords: [ "keyword1", "keyword2", "keyword3" ],
created_at: "..."
}
我想从所有文档中提取对于作者而言重要的关键字,因此我最终会得到如下结果:
{
author: "kris",
categories: {
mongodb: { keyword1: 5, keyword2: 3, keyword3: 1 },
ruby: { ... },
python: { ... }
}
}
对此的任何输入将不胜感激。
谢谢!
I am still trying to wrap my brain around map reduce. I have a collection of articles, each of which belongs to one category, and each article has a set of keywords. Assuming that the document looks like this:
{
author: "kris",
category: "mongodb",
content: "...",
keywords: [ "keyword1", "keyword2", "keyword3" ],
created_at: "..."
}
I want to essentially pull from all documents the keyword counts, in respect to the author, so I end up with something like:
{
author: "kris",
categories: {
mongodb: { keyword1: 5, keyword2: 3, keyword3: 1 },
ruby: { ... },
python: { ... }
}
}
Any input on this would be greatly appreciated.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
哦,你的问题让我多么兴奋!这实际上是我分布式系统课程的最后一个作业的一部分,所以它对我刚毕业的我来说相当新鲜。
对于解析细节,我只是 google Apache 的 Hadoop 教程,但我会给你一个总体概述。
基本上,这个问题需要两个 Map-Reduce 阶段。在第一个映射中,您的输入应该是
键值对列表(可能需要对文件进行一些预处理,但没什么大不了的) 。对于这些对中的每一对,您输出
作为要传递给减速器的对(您基本上是说每个单词都应该计算一次)。在第一个reduce过程中,前面的键值对将被方便地压缩,以便每个关键字都有自己的形式对< /code>,其中 1 的数量代表该单词在所有文档中出现的次数。因此,您只需对 1 进行求和并输出
。最后的映射/归约阶段只是按关键字的值对关键字进行排序。映射:
--> <总和,关键字>减少:<总和,{关键字}> --> <关键字,总和>
。这利用了映射缩减在传递到缩减阶段时按键排序的事实。现在,所有关键字都按字数统计排列!
Oh, how thrilled I am by your question! This was actually part of my last assignment for my distributed systems class, so its quite fresh in my recently-graduated mind.
For the parsing details, I'd just google Apache's Hadoop tutorial, but I'll give you the general overview.
Basically, this problem requires two Map-Reduce Phases. In the first map, your input should be a list of
<filename, {list of keywords}>
key-value pairs (might have to do a lil preprocessing on your files, but no biggie). For each of these pairs, you output<keyword, 1>
as the pair to be handed to the reducer (your basically saying every word should be counted once).In the first reduce pass, the previous key-value pairs will conveniently be condensed so that each keyword has its own pair of the form
<keyword, {1,1,1,1,1,1}>
, with the number of 1s representing the number of times the word appears throughout all of the documents. So you just sum up the 1s and output<keyword, sum>
.The final map/reduce phase is just to sort the keywords by their value. Map:
<keyword,sum> --> <sum,keyword> Reduce: <sum, {keywords}> --> <keyword,sum>
. This exploits the fact that map-reduce sorts by key when passes to the reduce phase.Now all of the keywords are next to their word count in sorted order!