在 Map/Reduce 中计算排名

发布于 2024-12-01 22:27:38 字数 527 浏览 1 评论 0原文

我有一个简单的问题很难用 SQL 解决,我想知道它是否可以在 Map-Reduce 系统中完成。

我想制作排名。想象一下亚马逊购买数据库(非常简化)

ORDERS
ISBN  copies_purchased
AAAA   5
AAAA   1
BBBBB  3
BBBBB  4
CCCC   3

我想生成排名表

rank ISBN     copies_purchased
1    BBBB     7
2    AAAA     6
3    CCCC     3

映射减少到计算的copys_purchased是显而易见的;计算排名的情况就不那么重要了,至少对我来说是这样。

(这不是家庭作业问题。我的实际工作需要这个。这样更好吗?)

编辑 我认为从标题、标签和问题文本中可以明显看出这一点,但这不是一个 SQL 问题。我想知道如何在map/reduce 中做到这一点。是的,我有数百万行。嗯,可能有数十亿。

I have a simple problem which is hard to solve in SQL and I'm wondering if it can be done in a map-reduce system.

I want to produce rankings. Imagine Amazon purchase database (much simplified)

ORDERS
ISBN  copies_purchased
AAAA   5
AAAA   1
BBBBB  3
BBBBB  4
CCCC   3

I want to produce the ranking table

rank ISBN     copies_purchased
1    BBBB     7
2    AAAA     6
3    CCCC     3

The map-reduce to calculated copies_purchased is obvious; calculating the rankings is less so, at least to me.

(This is not a homework problem. I need this for my actual job. Is that better?)

EDIT
I thought this would have been obvious from the title, and the tags, and the text of the question, but this is not a SQL question. I want to know how to do it in map/reduce. And yes, I have millions of rows. Well, probably billions.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

幽梦紫曦~ 2024-12-08 22:27:38

在CouchDB中,map/reduce构建一维索引,以便couch可以通过key快速找到任何信息。

首先,正如您所说,map/reduce 非常容易地构建 copies_purchased 视图。但键空间是 ISBN ID,它是您关心的,但它们没有特定的顺序。

对于小型应用程序,人们只需获取整个数据集并在内存中排序即可。如果您知道自己的要求,那么这是一个很好的捷径;但它无法扩展。

一个可扩展的解决方案是将这些行放入它们自己的数据库中。 第二映射/归约可以将来自copies_purchased的键和值发送回ISBN。 (不需要减少步骤。)

Key                 Value
copies_purchased    ISBN

7                   BBBB
6                   AAAA
3                   CCCC

您可以获取前 N 行,或者您可以通过使用 ?skip=6&limit=1 查询来查找排名第七的书。

In CouchDB, map/reduce builds 1-dimensional indexes so that couch can quickly find any information by key.

First, map/reduce builds the copies_purchased view pretty easily, as you say. But the key space is ISBN ID, it is the values that you care about, but they are in no particular order.

For small applications, people simply fetch the entire data set and sort in-memory. That is a great shortcut if you know your requirements; but it does not scale.

A scalable solution is to place these rows into their own database. A second map/reduce can emit keys from copies_purchased and values back to the ISBN. (There is no need for a reduce step.)

Key                 Value
copies_purchased    ISBN

7                   BBBB
6                   AAAA
3                   CCCC

You can fetch the top N rows, or you can find, e.g., the seventh-ranked book by querying with ?skip=6&limit=1

活泼老夫 2024-12-08 22:27:38

如果排名由销售的份数决定,那么您可以使用 sql select 游标构建该表:

select * from ORDERS orderby copies_purchased desc

然后根据检索记录的顺序分配排名

while (nextRecord) currRecord.rank = i++;

If the rank is determined by the number of copies sold, then you can build that table using a sql select cursor:

select * from ORDERS orderby copies_purchased desc

And then assign a rank based on the order you retrieve records

while (nextRecord) currRecord.rank = i++;
萌化 2024-12-08 22:27:38

我不确定你将如何使用 couchdb 来做到这一点。据我所知,没有办法直接将couchdb数据读取到hadoop中。我所知道的最接近的是 Brisk,它结合了 hadoop 和 cassandra。它也是免费的。

或者,如果不必是最新的,您可以将相关数据转储到文本或序列文件,并将它们用作您的输入。

我认为您必须分两步完成此操作。首先,生成购买的副本,这基本上是 hadoop 中常见的字数统计示例。

由于您可以通过查看购买的副本作业的输出相对轻松地找出购买的最大副本数(这可能本身就是一个作业),因此您可以创建一个自定义分区器,根据购买的副本来划分产品。因此,如果您有 3 个减速器,并且您的最大销量为 600 份,则减速器 1 接受销售 0 - 200 份的产品,减速器 2 接受销售 201 - 400 份的产品,减速器 3 接受销售 401 - 600 份的产品。然后,您可以合并已排序的减速器输出文件,然后您就可以得到已排序的已售副本列表。

或者,对于源代码,请查看 terasort 基准代码 此处。有关 Terasort 类的更多信息 这里

因此,您最终会得到如下工作流程:

  1. 计算每个产品销售份数的作业
  2. 根据上一个作业的输出查找最高销售份数的作业(尽管您可以跳过此步骤,具体取决于)
  3. 一项对数据进行排序并为您提供已售产品副本的排序列表的作业。可能会通过多个文件输出,因此您可能需要一个简单的脚本将它们合并在一起。

如需管理此类多步骤工作流程的帮助,请查看 Oozie级联

有关排序的更多信息,请参阅此答案

I'm not sure how you would do this using couchdb. As far as I know, there is no way to directly read couchdb data into hadoop. The closest thing I'm aware of is Brisk, which combines hadoop and cassandra. Its also free.

Alternatively, if it did not have to be up to the minute, you could dump the relevant data to text or sequence files, and use these as your input.

I think you would have to do this in a 2 step process. First, generate the copies purchased, which is basically the word count example that is so common with hadoop.

Since you can relatively easily find out the maximum number of copies purchased by looking at the output of the copies purchased job (this may be a job in itself), you could then create a custom partitioner that will divide the products according to the copies purchased. So if you have 3 reducers, and the max you sell is 600 copies, then reducer 1 takes products selling 0 - 200 copies, reducer 2 takes products selling 201 - 400, and reducer 3 takes prducts selling 401 - 600 copies. Then you can merge the sorted reducer output files, and you then have your sorted list of copies sold.

Or for source code, check out the terasort benchmarks code here. More info about Terasort classes here.

So you end up with a workflow like:

  1. A job to calculate the number of copies sold per product
  2. A job that finds the highest number of copies sold, based on the output of the previous job (though you might be able to skip this step depending on how you implement sort. )
  3. A job that sorts the data and gives you a sorted list of product copies sold. May be outputted over multiple files, so you might need a simple script that merges them together.

For help managing a multi step workflow like this, have a look at Oozie or Cascading.

For more on sorting see this answer.

二智少女猫性小仙女 2024-12-08 22:27:38

除非你有数百万行,否则用 SQL 很容易解决。

SELECT ISBN, count(*) FROMorders GROUP BY ISBN ORDER BY 2 desc;

Its easy to solve in SQL unless you have a millions of rows.

SELECT ISBN, count(*) FROM orders GROUP BY ISBN ORDER BY 2 desc;

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文