从操作进行组时,列的逻辑上的基数应该很重要。当我们编写涉及
组组的Hive查询时,由于我们熟悉了要查询的数据,因此我们对
组中涉及的单个列的基数有一个想法。但是Hive对此一无所知。因此,假设蜂巢查询是: -
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col1,Col2,Col3,Col4,Col5
我知道这里所有5列的基数程度。但是Hive不知道,所以Hive可能会表现出最糟糕的情况。
因此,假设我对这些列的基数信息是这样的,从最低到最高,也给出包含的值的示例:-col5
- =它包含country Name
- col4 =它包含状态名称
- col3 =它包含城市名称
- col2 = it包含邮政编码
- col1 =它包含电子邮件地址,
现在Hive将所有这些都一样,如果Hive知道潜在的基础性信息,那么它可以利用这一点来计算独特的群体,这是否是有益的?在这种情况下,如果我按照基数顺序明确地将列在组中的
子句中排列,那么如下示例所示,它是否有效?
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col5,Col4,Col3,Col2,Col1
还是Hive将忽略此顺序,而不管订单如何,都同样对待所有列?
Logically cardinality of columns should matter while doing GROUP BY
operation. When we write Hive queries involving GROUP BY
, since we are familiar with the data being queried, we have an idea about cardinality of individual columns involved in the GROUP BY
. But Hive has no idea about this. So let's say the Hive query in question is:-
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col1,Col2,Col3,Col4,Col5
I know the degree of cardinality of all the 5 columns here. But Hive doesn't know that, so Hive will probably perform the worst.
So let's say the cardinality information that I have about these columns is like this, from lowest to highest and also giving example of values contained:-
- Col5 = it contains country name
- Col4 = it contains state name
- Col3 = it contains city name
- Col2 = it contains postal code
- Col1 = it contains email address
Now Hive will treat all these the same , won't it be beneficial if Hive knew about underlying cardinality information so it could exploit this in calculating unique groups? In that case if I explicitly arrange the columns in the GROUP BY
clause in the order of cardinality, will it be efficient as shown in the following example ?
SELECT Col1,Col2,Col3,Col4,Col5,COUNT(*) FROM MyTable GROUP BY Col5,Col4,Col3,Col2,Col1
Or hive will ignore this order and treat all the columns equally regardless of the order?
发布评论