hive sql 聚合
我在 Hive 中有两个表,t1
和 t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
我想要的是计算给定日期之前三天的不同用户 ID。
例如,对于日期 2011-01-03
,我想计算从 2011-01-01
到 2011-01-03
的不同用户 ID .
对于日期2011-01-04
,我想计算从2011-01-02
到2011-01-04
的不同用户ID
我写了以下查询。但它不会返回三天的结果。相反,它每天返回不同的用户 ID。
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
也就是说,以下部分不生效。
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
编辑:一种解决方案可以是(但速度非常慢):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
更新:感谢所有答案。他们很好。
但 Hive 与 SQL 有点不同。不幸的是,它们不能在 HIVE 中使用。 我当前的解决方案是使用UNION ALL
。
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
然后,我进行group by
和count
。这样我就可以得到我想要的了。
虽然它并不优雅,但它比交叉连接
高效得多。
I have two tables in Hive, t1
and t2
>describe t1;
>date_id string
>describe t2;
>messageid string,
createddate string,
userid int
> select * from t1 limit 3;
> 2011-01-01 00:00:00
2011-01-02 00:00:00
2011-01-03 00:00:00
> select * from t2 limit 3;
87211389 2011-01-03 23:57:01 13864753
87211656 2011-01-03 23:57:59 13864769
87211746 2011-01-03 23:58:25 13864785
What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03
, I want to count distinct userid from 2011-01-01
to 2011-01-03
.
for date 2011-01-04
, I want to count distinct userid from 2011-01-02
to 2011-01-04
I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.
SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2
ON (to_date(t2.createddate) = to_date(t1.date_id))
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
GROUP by to_date(t1.date_id);
`to_date()` and `date_sub()` are date function in Hive.
That said, the following part does not take effect.
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
EDIT: One solution can be (but it is super slow):
SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
SELECT * FROM t1 LEFT OUTER JOIN t2
WHERE
(date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
AND to_date(t2.createddate) <= to_date(t1.date_id)
)
) t3
GROUP by to_date(t3.date_id);
UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL
.
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
UNION ALL
SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)
Then, I do group by
and count
. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以下内容似乎可以在标准 SQL 中工作...
但是,它将会,,但速度很慢。因为您将日期存储为字符串,所以使用 to_date() 将它们转换为日期。这意味着无法使用索引,并且 SQL 引擎无法采取任何巧妙的措施来减少所花费的工作量。
因此,每一种可能的行组合都需要进行比较。如果 T1 中有 100 个条目,T2 中有 10,000 个条目,则 SQL 引擎正在处理一百万个组合。
如果您将这些值存储为日期,则不需要
to_date()
。如果您对日期建立索引,SQL 引擎可以快速定位指定的日期范围。注意:
ON
子句的格式意味着您不需要需要将t2.createddate
向下舍入为每日值。编辑 为什么你的代码不起作用...
这使用
(to_date(t2.createddate) = to_date 的
。由于连接是 LEFT OUTER JOIN,因此ON
子句将 t1 与 t2 连接起来(t1.date_id))t2.createddate
中的值现在必须为 NULL(无匹配项)或与t1.date_id< /代码>。
WHERE
子句允许更广泛的范围(3 天)。但是JOIN
的ON
子句已经将您的数据限制为一天。我上面给出的示例只是采用您的
WHERE
子句并将其替换旧的ON
子句。编辑
Hive 不允许在 ON 子句中使用
<=
和>=
吗?您真的决定使用HIVE吗???如果你真的是,那么 BETWEEN 怎么样?
或者,重构您的日期表以枚举您想要包含的日期...
The following should seem to work in standard SQL...
It will, however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.
As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.
If you store these values as dates, you don't need
to_date()
. And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.NOTE: The format of the
ON
clause means that you do not need to roundt2.createddate
down to a daily value.EDIT Why your code didn't work...
This joins t1 to t2 with an
ON
clause of(to_date(t2.createddate) = to_date(t1.date_id))
. As the join is a LEFT OUTER JOIN, the values int2.createddate
MUST now either be NULL (no matches) or be the same ast1.date_id
.The
WHERE
clause allows a much wider range (3 days). But theON
clause of theJOIN
has already restricted you data down to a single day.The example I gave above simply takes your
WHERE
clause and put's it in place of the oldON
clause.EDIT
Hive doesn't allow
<=
and>=
in the ON clause? Are you really fixed in to using HIVE???If you really are, what about BETWEEN?
Alternatively, refactor your table of dates to enumerate the dates you want to include...
您需要一个子查询:
尝试这样的操作(我无法测试,因为我没有配置单元),
关键是使用 t1 中每个日期的子查询,在 t2 中选择正确的记录。
编辑:
在 from 子句中强制子查询您可以尝试以下操作:
但不知道是否可行。
You need a subquery:
try something like this (i cannot test because i don't have hive)
the key is that with subquery FOR EACH date in t1, the right records are selected in t2.
EDIT:
Forcing subquery in from clause you could try this:
but don't know if could work.
我假设 t1 用于定义 3 天的时间段。我怀疑这种令人费解的方法是由于 Hive 的缺点造成的。
这允许您拥有任意数量的 3 天周期。
尝试以下 2 个查询
--或者 --
后者最大限度地减少 t2 表上的函数调用。我还假设 t1 是 2 中较小的一个。
substring 应返回与 to_date 相同的结果。根据文档, https://cwiki.apache.org/ confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions,to_date 返回字符串数据类型。
对日期数据类型的支持似乎很少,但我对配置单元不熟悉。
I am making an assumption that t1 is used to define the 3 day period. I suspect the puzzling approach is due to Hive's shortcomings.
This allows you to have an arbitrary number of 3 day periods.
Try the following 2 queries
--or--
The latter minimizes the function calls on the t2 table. I am also assuming that t1 is the smaller of the 2.
substring should return the same result as to_date. According to the documentation, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date returns a string data type.
Support for date data types seems minimal but I am not familiar with hive.