如何使用 Hive 对大数据进行高效排序(order by)?
我想有效地对大数据集进行排序(即使用自定义分区器,如下所述:MapReduce排序算法是如何工作的?),但我想用hive来实现。
然而,Hive手册指出“order by”是由单个reducer执行的。 这让我感到惊讶,因为 Pig 确实实现了与文章类似的东西 - 猪impl
我错过了什么,还是 hive 根本不适合这项工作?
I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.
However, the Hive manual states that "order by" is performed by a single reducer.
This surprises me, as pig does implement something similar to the article - pig impl
Am I missing something, or is it that hive simply isn't the right hammer for this job?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为 Hive 不是适合这项工作的工具。至少现在是这样。它被构建为用作 OLAP/报告工具,并且未针对生成大型结果数据集进行优化,因为大多数分析查询生成相对较小的结果集。结果是——他们有很好的 TOP N 能力,但没有很好的总顺序。
以防万一,如果您之前没有遇到过 - 我建议查看 Hadoop 的 terasort 示例,该示例专门旨在使用 MR 以最佳方式对大型数据集进行排序。 http:// /hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html
I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result - they have good TOP N capability but not good total order.
Just in case if you didn't encounter it before - I am suggesting to look inte Hadoop's terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html
在 Hive 中不可能使用多个减速器进行全排序。它尚未实施 - https://issues.apache.org/jira/browse/ HIVE-1402。
如果您想要高效的总排序,使用 Pig 会比编写自定义 MR 作业更容易。
It is not possible to use multiple reducers for doing total ordering in Hive. It has not been implemented yet - https://issues.apache.org/jira/browse/HIVE-1402 .
It will be easier to use pig instead of writing custom MR job, if you want efficient total ordering.
Hive 生成 MapReduce 作业来执行查询。在您的特定情况下,实际排序是由 Hadoop MapReduce 框架在将数据输入到减速器之前完成的。
Hive generates MapReduce job(s) for executing the queries. In your particular case the actual sorting is done by the Hadoop MapReduce framework before the data is fed into the reducer.