Hive通过查询获取组中的前n条记录

发布于 2025-01-07 12:55:08 字数 318 浏览 5 评论 0 原文

我在配置单元中有下表

用户 ID、用户名、用户地址、点击次数、展示次数、页面 ID、页面名称

我需要找出前 5 个用户 [用户 ID、用户名、用户地址]通过每个页面 [page-id,page-name] 的点击次数

我知道我们需要首先按 [page-id,page-name] 进行分组,并且在每个组中我想按 [clicks,impressions] desc 进行排序,然后仅发出前 5 名每个页面的 users[user-id, user-name, user-address] 但我发现构建查询很困难。

我们如何使用 HIVe UDF 来做到这一点?

I have following table in hive

user-id, user-name, user-address,clicks,impressions,page-id,page-name

I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]

I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.

How can we do this using HIve UDF ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

只为守护你 2025-01-14 12:55:08

从 Hive 0.11 开始,您可以使用 Hive 内置的rank()函数并使用更简单的语义来执行此操作 Hive 的内置分析和窗口功能。遗憾的是,我找不到我想要的那么多示例,但它们确实非常有用。使用这些,rank() 和WhereWithRankCond 都是内置的,因此您可以这样做:

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM my table
) ranked_mytable
WHERE ranked_mytable.rank < 5
ORDER BY page-id, rank

不需要UDF,并且只有一个子查询!此外,所有排名逻辑都是本地化的。

您可以在这个 Jira 中找到更多这些函数的示例(尽管还不足以满足我的喜好)< /a> 和此人的博客

As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM my table
) ranked_mytable
WHERE ranked_mytable.rank < 5
ORDER BY page-id, rank

No UDF required, and only one subquery! Also, all of the rank logic is localized.

You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.

怀念你的温柔 2025-01-14 12:55:08

修订答案,修复@Himanshu Gahlot 提到的错误

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
        SELECT page-id, user-id, clicks FROM mytable
        DISTRIBUTE BY page-id
        SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank

请注意,rank() UDAF 应用于 page-id 列,其新值用于重置或增加排名计数器(例如,重置每个 page-id 分区的计数器) )

Revised answer, fixing the bug as mentioned by @Himanshu Gahlot

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
        SELECT page-id, user-id, clicks FROM mytable
        DISTRIBUTE BY page-id
        SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank

Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)

鱼窥荷 2025-01-14 12:55:08

您可以使用此处描述的rank() UDF来完成此操作:http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(user-id) as rank, clicks
    FROM mytable
    DISTRIBUTE BY page-id, user-id
    SORT BY page-id, user-id, clicks desc
) a 
WHERE rank < 5
ORDER BY page-id, rank

You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(user-id) as rank, clicks
    FROM mytable
    DISTRIBUTE BY page-id, user-id
    SORT BY page-id, user-id, clicks desc
) a 
WHERE rank < 5
ORDER BY page-id, rank
旧瑾黎汐 2025-01-14 12:55:08

假设您的数据如下所示:

page-id   user-id   clicks
page1     user1     10
page1     user2     10
page1     user3     9
page1     user4     8
page1     user5     7
page1     user6     7
page1     user7     6
page1     user8     5
page2     user1     20
page2     user2     19
page2     user3     18

下面的查询将为您提供:

SELECT page-id, user-id, clicks, rank
FROM (
    SELECT page-id, user-id, rank() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5

结果:

page-id   user-id   clicks  rank
page1     user1     10      1
page1     user2     10      1 
page1     user3     9       3 
page1     user4     8       4
page1     user5     7       5 
page1     user6     7       5 
page2     user1     20      1
page2     user2     19      2  
page2     user3     18      3

因此,对于 page1,您将获得 6 个用户,因为具有相同点击次数的用户排名相同。

但是,如果您正在寻找 5 个用户,并随机选择,以防多个用户处于同一排名。您可以使用以下查询

SELECT page-id, user-id, clicks, rank
FROM (
    SELECT page-id, user-id, row_number() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5

结果:

page-id   user-id   clicks  rank
page1     user1     10      1
page1     user2     10      2 
page1     user3     9       3 
page1     user4     8       4
page1     user5     7       5 
page2     user1     20      1
page2     user2     19      2  
page2     user3     18      3

Let us say your data looks like following :

page-id   user-id   clicks
page1     user1     10
page1     user2     10
page1     user3     9
page1     user4     8
page1     user5     7
page1     user6     7
page1     user7     6
page1     user8     5
page2     user1     20
page2     user2     19
page2     user3     18

Below Query will give you :

SELECT page-id, user-id, clicks, rank
FROM (
    SELECT page-id, user-id, rank() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5

Result :

page-id   user-id   clicks  rank
page1     user1     10      1
page1     user2     10      1 
page1     user3     9       3 
page1     user4     8       4
page1     user5     7       5 
page1     user6     7       5 
page2     user1     20      1
page2     user2     19      2  
page2     user3     18      3

So, for page1 you are getting 6 users, as users with same number of clicks are ranked same.

But, if you are looking for exactly 5 users, and pick randomly in case multiple users fall in same rank. You can use the below query

SELECT page-id, user-id, clicks, rank
FROM (
    SELECT page-id, user-id, row_number() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5

Result :

page-id   user-id   clicks  rank
page1     user1     10      1
page1     user2     10      2 
page1     user3     9       3 
page1     user4     8       4
page1     user5     7       5 
page2     user1     20      1
page2     user2     19      2  
page2     user3     18      3
指尖上的星空 2025-01-14 12:55:08

您可以使用 hivemall 的 >each_top_k 功能 Apache Hive 上的高效 top-k 计算。

select
  page-id, 
  user-id,
  clicks
from (
  select
    each_top_k(5, page-id, clicks, page-id, user-id)
      as (rank, clicks, page-id, user-id)
  from (
    select
      page-id, user-id, clicks
    from
      mytable
    DISTRIBUTE BY page-id SORT BY page-id
  ) t1
) t2
order by page-id ASC, clicks DESC

与在 Hive 中运行 top-k 查询(例如,distributed by/rank)的其他方法相比,each_top_k UDTF 非常快,因为它不保存中间结果。

You can use each_top_k function of hivemall for an efficient top-k computation on Apache Hive.

select
  page-id, 
  user-id,
  clicks
from (
  select
    each_top_k(5, page-id, clicks, page-id, user-id)
      as (rank, clicks, page-id, user-id)
  from (
    select
      page-id, user-id, clicks
    from
      mytable
    DISTRIBUTE BY page-id SORT BY page-id
  ) t1
) t2
order by page-id ASC, clicks DESC

The each_top_k UDTF is very fast when compared to other methods running top-k queries (e.g., distributed by/rank) in Hive because it does not hold the whole ranking for the intermediate result.

做个少女永远怀春 2025-01-14 12:55:08

select * from (select user_id,user-name,user-address,page,click,row_num() over (partition by page order by clicks desc) a where a.row_num<=5

在选择列时可能会发生变化,但是逻辑是正确的。

select * from (select user_id,user-name,user-address,page,click,row_num() over (partition by page order by clicks desc) a where a.row_num<=5

there might be a change in choosing the columns but the logic is correct.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文