在 Map/Reduce 中计算排名
我有一个简单的问题很难用 SQL 解决,我想知道它是否可以在 Map-Reduce 系统中完成。
我想制作排名。想象一下亚马逊购买数据库(非常简化)
ORDERS
ISBN copies_purchased
AAAA 5
AAAA 1
BBBBB 3
BBBBB 4
CCCC 3
我想生成排名表
rank ISBN copies_purchased
1 BBBB 7
2 AAAA 6
3 CCCC 3
映射减少到计算的copys_purchased是显而易见的;计算排名的情况就不那么重要了,至少对我来说是这样。
(这不是家庭作业问题。我的实际工作需要这个。这样更好吗?)
编辑 我认为从标题、标签和问题文本中可以明显看出这一点,但这不是一个 SQL 问题。我想知道如何在map/reduce 中做到这一点。是的,我有数百万行。嗯,可能有数十亿。
I have a simple problem which is hard to solve in SQL and I'm wondering if it can be done in a map-reduce system.
I want to produce rankings. Imagine Amazon purchase database (much simplified)
ORDERS
ISBN copies_purchased
AAAA 5
AAAA 1
BBBBB 3
BBBBB 4
CCCC 3
I want to produce the ranking table
rank ISBN copies_purchased
1 BBBB 7
2 AAAA 6
3 CCCC 3
The map-reduce to calculated copies_purchased is obvious; calculating the rankings is less so, at least to me.
(This is not a homework problem. I need this for my actual job. Is that better?)
EDIT
I thought this would have been obvious from the title, and the tags, and the text of the question, but this is not a SQL question. I want to know how to do it in map/reduce. And yes, I have millions of rows. Well, probably billions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在CouchDB中,map/reduce构建一维索引,以便couch可以通过key快速找到任何信息。
首先,正如您所说,map/reduce 非常容易地构建
copies_purchased
视图。但键空间是 ISBN ID,它是您关心的值,但它们没有特定的顺序。对于小型应用程序,人们只需获取整个数据集并在内存中排序即可。如果您知道自己的要求,那么这是一个很好的捷径;但它无法扩展。
一个可扩展的解决方案是将这些行放入它们自己的数据库中。 第二映射/归约可以将来自
copies_purchased
的键和值发送回ISBN。 (不需要减少步骤。)您可以获取前 N 行,或者您可以通过使用
?skip=6&limit=1
查询来查找排名第七的书。In CouchDB, map/reduce builds 1-dimensional indexes so that couch can quickly find any information by key.
First, map/reduce builds the
copies_purchased
view pretty easily, as you say. But the key space is ISBN ID, it is the values that you care about, but they are in no particular order.For small applications, people simply fetch the entire data set and sort in-memory. That is a great shortcut if you know your requirements; but it does not scale.
A scalable solution is to place these rows into their own database. A second map/reduce can emit keys from
copies_purchased
and values back to the ISBN. (There is no need for a reduce step.)You can fetch the top N rows, or you can find, e.g., the seventh-ranked book by querying with
?skip=6&limit=1
如果排名由销售的份数决定,那么您可以使用 sql select 游标构建该表:
然后根据检索记录的顺序分配排名
If the rank is determined by the number of copies sold, then you can build that table using a sql select cursor:
And then assign a rank based on the order you retrieve records
我不确定你将如何使用 couchdb 来做到这一点。据我所知,没有办法直接将couchdb数据读取到hadoop中。我所知道的最接近的是 Brisk,它结合了 hadoop 和 cassandra。它也是免费的。
或者,如果不必是最新的,您可以将相关数据转储到文本或序列文件,并将它们用作您的输入。
我认为您必须分两步完成此操作。首先,生成购买的副本,这基本上是 hadoop 中常见的字数统计示例。
由于您可以通过查看购买的副本作业的输出相对轻松地找出购买的最大副本数(这可能本身就是一个作业),因此您可以创建一个自定义分区器,根据购买的副本来划分产品。因此,如果您有 3 个减速器,并且您的最大销量为 600 份,则减速器 1 接受销售 0 - 200 份的产品,减速器 2 接受销售 201 - 400 份的产品,减速器 3 接受销售 401 - 600 份的产品。然后,您可以合并已排序的减速器输出文件,然后您就可以得到已排序的已售副本列表。
或者,对于源代码,请查看 terasort 基准代码 此处。有关 Terasort 类的更多信息 这里。
因此,您最终会得到如下工作流程:
如需管理此类多步骤工作流程的帮助,请查看 Oozie 或 级联。
有关排序的更多信息,请参阅此答案。
I'm not sure how you would do this using couchdb. As far as I know, there is no way to directly read couchdb data into hadoop. The closest thing I'm aware of is Brisk, which combines hadoop and cassandra. Its also free.
Alternatively, if it did not have to be up to the minute, you could dump the relevant data to text or sequence files, and use these as your input.
I think you would have to do this in a 2 step process. First, generate the copies purchased, which is basically the word count example that is so common with hadoop.
Since you can relatively easily find out the maximum number of copies purchased by looking at the output of the copies purchased job (this may be a job in itself), you could then create a custom partitioner that will divide the products according to the copies purchased. So if you have 3 reducers, and the max you sell is 600 copies, then reducer 1 takes products selling 0 - 200 copies, reducer 2 takes products selling 201 - 400, and reducer 3 takes prducts selling 401 - 600 copies. Then you can merge the sorted reducer output files, and you then have your sorted list of copies sold.
Or for source code, check out the terasort benchmarks code here. More info about Terasort classes here.
So you end up with a workflow like:
For help managing a multi step workflow like this, have a look at Oozie or Cascading.
For more on sorting see this answer.
除非你有数百万行,否则用 SQL 很容易解决。
SELECT ISBN, count(*) FROMorders GROUP BY ISBN ORDER BY 2 desc;
Its easy to solve in SQL unless you have a millions of rows.
SELECT ISBN, count(*) FROM orders GROUP BY ISBN ORDER BY 2 desc;