在 Map/Reduce 中计算排名

发布于 2024-12-01 22:27:38 字数 527 浏览 1 评论 0原文

我有一个简单的问题很难用 SQL 解决，我想知道它是否可以在 Map-Reduce 系统中完成。

我想制作排名。想象一下亚马逊购买数据库（非常简化）

ORDERS
ISBN  copies_purchased
AAAA   5
AAAA   1
BBBBB  3
BBBBB  4
CCCC   3

我想生成排名表

rank ISBN     copies_purchased
1    BBBB     7
2    AAAA     6
3    CCCC     3

映射减少到计算的copys_purchased是显而易见的；计算排名的情况就不那么重要了，至少对我来说是这样。

（这不是家庭作业问题。我的实际工作需要这个。这样更好吗？）

编辑我认为从标题、标签和问题文本中可以明显看出这一点，但这不是一个 SQL 问题。我想知道如何在map/reduce 中做到这一点。是的，我有数百万行。嗯，可能有数十亿。

原文

I have a simple problem which is hard to solve in SQL and I'm wondering if it can be done in a map-reduce system.

I want to produce rankings. Imagine Amazon purchase database (much simplified)

ORDERS
ISBN  copies_purchased
AAAA   5
AAAA   1
BBBBB  3
BBBBB  4
CCCC   3

I want to produce the ranking table

rank ISBN     copies_purchased
1    BBBB     7
2    AAAA     6
3    CCCC     3

The map-reduce to calculated copies_purchased is obvious; calculating the rankings is less so, at least to me.

(This is not a homework problem. I need this for my actual job. Is that better?)

EDIT
I thought this would have been obvious from the title, and the tags, and the text of the question, but this is not a SQL question. I want to know how to do it in map/reduce. And yes, I have millions of rows. Well, probably billions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幽梦紫曦～ 2024-12-08 22:27:38

在CouchDB中，map/reduce构建一维索引，以便couch可以通过key快速找到任何信息。

首先，正如您所说，map/reduce 非常容易地构建 copies_purchased 视图。但键空间是 ISBN ID，它是您关心的值，但它们没有特定的顺序。

对于小型应用程序，人们只需获取整个数据集并在内存中排序即可。如果您知道自己的要求，那么这是一个很好的捷径；但它无法扩展。

一个可扩展的解决方案是将这些行放入它们自己的数据库中。第二映射/归约可以将来自copies_purchased的键和值发送回ISBN。（不需要减少步骤。）

Key                 Value
copies_purchased    ISBN

7                   BBBB
6                   AAAA
3                   CCCC

您可以获取前 N 行，或者您可以通过使用 ?skip=6&limit=1 查询来查找排名第七的书。

In CouchDB, map/reduce builds 1-dimensional indexes so that couch can quickly find any information by key.

First, map/reduce builds the copies_purchased view pretty easily, as you say. But the key space is ISBN ID, it is the values that you care about, but they are in no particular order.

For small applications, people simply fetch the entire data set and sort in-memory. That is a great shortcut if you know your requirements; but it does not scale.

A scalable solution is to place these rows into their own database. A second map/reduce can emit keys from copies_purchased and values back to the ISBN. (There is no need for a reduce step.)

Key                 Value
copies_purchased    ISBN

7                   BBBB
6                   AAAA
3                   CCCC

You can fetch the top N rows, or you can find, e.g., the seventh-ranked book by querying with ?skip=6&limit=1

回复收藏 0 原文

活泼老夫 2024-12-08 22:27:38

如果排名由销售的份数决定，那么您可以使用 sql select 游标构建该表：

select * from ORDERS orderby copies_purchased desc

然后根据检索记录的顺序分配排名

while (nextRecord) currRecord.rank = i++;

If the rank is determined by the number of copies sold, then you can build that table using a sql select cursor:

select * from ORDERS orderby copies_purchased desc

And then assign a rank based on the order you retrieve records

while (nextRecord) currRecord.rank = i++;

回复收藏 0 原文

萌化 2024-12-08 22:27:38

我不确定你将如何使用 couchdb 来做到这一点。据我所知，没有办法直接将couchdb数据读取到hadoop中。我所知道的最接近的是 Brisk，它结合了 hadoop 和 cassandra。它也是免费的。

或者，如果不必是最新的，您可以将相关数据转储到文本或序列文件，并将它们用作您的输入。

我认为您必须分两步完成此操作。首先，生成购买的副本，这基本上是 hadoop 中常见的字数统计示例。

由于您可以通过查看购买的副本作业的输出相对轻松地找出购买的最大副本数（这可能本身就是一个作业），因此您可以创建一个自定义分区器，根据购买的副本来划分产品。因此，如果您有 3 个减速器，并且您的最大销量为 600 份，则减速器 1 接受销售 0 - 200 份的产品，减速器 2 接受销售 201 - 400 份的产品，减速器 3 接受销售 401 - 600 份的产品。然后，您可以合并已排序的减速器输出文件，然后您就可以得到已排序的已售副本列表。

或者，对于源代码，请查看 terasort 基准代码此处。有关 Terasort 类的更多信息这里。

因此，您最终会得到如下工作流程：