Hive通过查询获取组中的前n条记录
我在配置单元中有下表
用户 ID、用户名、用户地址、点击次数、展示次数、页面 ID、页面名称
我需要找出前 5 个用户 [用户 ID、用户名、用户地址]通过每个页面 [page-id,page-name] 的点击次数
我知道我们需要首先按 [page-id,page-name] 进行分组,并且在每个组中我想按 [clicks,impressions] desc 进行排序,然后仅发出前 5 名每个页面的 users[user-id, user-name, user-address] 但我发现构建查询很困难。
我们如何使用 HIVe UDF 来做到这一点?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
从 Hive 0.11 开始,您可以使用 Hive 内置的rank()函数并使用更简单的语义来执行此操作 Hive 的内置分析和窗口功能。遗憾的是,我找不到我想要的那么多示例,但它们确实非常有用。使用这些,rank() 和WhereWithRankCond 都是内置的,因此您可以这样做:
不需要UDF,并且只有一个子查询!此外,所有排名逻辑都是本地化的。
您可以在这个 Jira 中找到更多这些函数的示例(尽管还不足以满足我的喜好)< /a> 和此人的博客。
As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:
No UDF required, and only one subquery! Also, all of the rank logic is localized.
You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.
修订答案,修复@Himanshu Gahlot 提到的错误
请注意,rank() UDAF 应用于 page-id 列,其新值用于重置或增加排名计数器(例如,重置每个 page-id 分区的计数器) )
Revised answer, fixing the bug as mentioned by @Himanshu Gahlot
Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)
您可以使用此处描述的rank() UDF来完成此操作:http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
假设您的数据如下所示:
下面的查询将为您提供:
结果:
因此,对于 page1,您将获得 6 个用户,因为具有相同点击次数的用户排名相同。
但是,如果您正在寻找 5 个用户,并随机选择,以防多个用户处于同一排名。您可以使用以下查询
结果:
Let us say your data looks like following :
Below Query will give you :
Result :
So, for page1 you are getting 6 users, as users with same number of clicks are ranked same.
But, if you are looking for exactly 5 users, and pick randomly in case multiple users fall in same rank. You can use the below query
Result :
您可以使用
hivemall
的 >each_top_k
功能 Apache Hive 上的高效 top-k 计算。与在 Hive 中运行 top-k 查询(例如,
distributed by/rank
)的其他方法相比,each_top_k
UDTF 非常快,因为它不保存中间结果。You can use
each_top_k
function ofhivemall
for an efficient top-k computation on Apache Hive.The
each_top_k
UDTF is very fast when compared to other methods running top-k queries (e.g.,distributed by/rank
) in Hive because it does not hold the whole ranking for the intermediate result.select * from (select user_id,user-name,user-address,page,click,row_num() over (partition by page order by clicks desc) a where a.row_num<=5
在选择列时可能会发生变化,但是逻辑是正确的。
select * from (select user_id,user-name,user-address,page,click,row_num() over (partition by page order by clicks desc) a where a.row_num<=5
there might be a change in choosing the columns but the logic is correct.