hive sql 聚合

发布于 2024-12-07 19:27:00 字数 2140 浏览 0 评论 0原文

我在 Hive 中有两个表，t1 和 t2

>describe t1;
>date_id    string

>describe t2;
>messageid string,
 createddate string,
 userid int

> select * from t1 limit 3;        
> 2011-01-01 00:00:00 
  2011-01-02 00:00:00 
  2011-01-03 00:00:00 

> select * from t2 limit 3;
87211389    2011-01-03 23:57:01 13864753
87211656    2011-01-03 23:57:59 13864769
87211746    2011-01-03 23:58:25 13864785

我想要的是计算给定日期之前三天的不同用户 ID。
例如，对于日期 2011-01-03，我想计算从 2011-01-01 到 2011-01-03 的不同用户 ID .
对于日期2011-01-04，我想计算从2011-01-02到2011-01-04的不同用户ID

我写了以下查询。但它不会返回三天的结果。相反，它每天返回不同的用户 ID。

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

`to_date()` and `date_sub()` are date function in Hive.

也就是说，以下部分不生效。

WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id)

编辑：一种解决方案可以是（但速度非常慢）：

SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
 SELECT * FROM t1  LEFT OUTER JOIN t2
 WHERE 
 (date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
  AND to_date(t2.createddate) <= to_date(t1.date_id)
 )
) t3 
GROUP by to_date(t3.date_id);

更新：感谢所有答案。他们很好。
但 Hive 与 SQL 有点不同。不幸的是，它们不能在 HIVE 中使用。我当前的解决方案是使用UNION ALL。

 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
 UNION ALL
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
 UNION ALL 
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)

然后，我进行group by 和count。这样我就可以得到我想要的了。
虽然它并不优雅，但它比交叉连接高效得多。

原文

I have two tables in Hive, t1 and t2

>describe t1;
>date_id    string

>describe t2;
>messageid string,
 createddate string,
 userid int

> select * from t1 limit 3;        
> 2011-01-01 00:00:00 
  2011-01-02 00:00:00 
  2011-01-03 00:00:00 

> select * from t2 limit 3;
87211389    2011-01-03 23:57:01 13864753
87211656    2011-01-03 23:57:59 13864769
87211746    2011-01-03 23:58:25 13864785

What I want is to count previous three-day distinct userid for a given date.
For example, for date 2011-01-03, I want to count distinct userid from 2011-01-01 to 2011-01-03.
for date 2011-01-04, I want to count distinct userid from 2011-01-02 to 2011-01-04

I wrote the following query. But it does not return three-day result. It returns distinct userid per day instead.

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

`to_date()` and `date_sub()` are date function in Hive.

That said, the following part does not take effect.

WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id)

EDIT: One solution can be (but it is super slow):

SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
 SELECT * FROM t1  LEFT OUTER JOIN t2
 WHERE 
 (date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
  AND to_date(t2.createddate) <= to_date(t1.date_id)
 )
) t3 
GROUP by to_date(t3.date_id);

UPDATE: Thanks for all answers. They are good.
But Hive is a bit different from SQL. Unfortunately, they cannot use in HIVE.
My current solution is to use UNION ALL.

 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
 UNION ALL
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
 UNION ALL 
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)

Then, I do group by and count. In this way, I can get what I want.
Although it is not elegant, it is much efficient than cross join.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

把时间冻结 2024-12-14 19:27:00

以下内容似乎可以在标准 SQL 中工作...

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON  to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
    AND to_date(t2.createddate) <  date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

但是，它将会，，但速度很慢。因为您将日期存储为字符串，所以使用 to_date() 将它们转换为日期。这意味着无法使用索引，并且 SQL 引擎无法采取任何巧妙的措施来减少所花费的工作量。

因此，每一种可能的行组合都需要进行比较。如果 T1 中有 100 个条目，T2 中有 10,000 个条目，则 SQL 引擎正在处理一百万个组合。

如果您将这些值存储为日期，则不需要 to_date()。如果您对日期建立索引，SQL 引擎可以快速定位指定的日期范围。

注意：ON 子句的格式意味着您不需要需要将t2.createddate 向下舍入为每日值。

编辑为什么你的代码不起作用...

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

这使用 (to_date(t2.createddate) = to_date 的 ON 子句将 t1 与 t2 连接起来(t1.date_id))。由于连接是 LEFT OUTER JOIN，因此 t2.createddate 中的值现在必须为 NULL（无匹配项）或与 t1.date_id< /代码>。

WHERE 子句允许更广泛的范围（3 天）。但是 JOIN 的 ON 子句已经将您的数据限制为一天。

我上面给出的示例只是采用您的 WHERE 子句并将其替换旧的 ON 子句。

编辑

Hive 不允许在 ON 子句中使用 <= 和 >= 吗？您真的决定使用HIVE吗？？？

如果你真的是，那么 BETWEEN 怎么样？

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

或者，重构您的日期表以枚举您想要包含的日期...

TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
  2011-01-03, 2011-01-02
  2011-01-03, 2011-01-03

  2011-01-04, 2011-01-02
  2011-01-04, 2011-01-03
  2011-01-04, 2011-01-04

  2011-01-05, 2011-01-03
  2011-01-05, 2011-01-04
  2011-01-05, 2011-01-05 }

SELECT
  to_date(t1.calendar_date),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
  to_date(t1.calendar_date)

The following should seem to work in standard SQL...

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON  to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
    AND to_date(t2.createddate) <  date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

It will, however, be slow. Because you are storing dates as strings, the using to_date() to convert them to dates. What this means is that indexes can't be used, and the SQL engine can't do Anything clever to reduce the effort being expended.

As a result, every possible combination of rows needs to be compared. If you have 100 entries in T1 and 10,000 entries in T2, your SQL engine is processing a million combinations.

If you store these values as dates, you don't need to_date(). And if you index the dates, the SQL engine can quickly home in on the range of dates being specified.

NOTE: The format of the ON clause means that you do not need to round t2.createddate down to a daily value.

EDIT Why your code didn't work...

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

This joins t1 to t2 with an ON clause of (to_date(t2.createddate) = to_date(t1.date_id)). As the join is a LEFT OUTER JOIN, the values in t2.createddate MUST now either be NULL (no matches) or be the same as t1.date_id.

The WHERE clause allows a much wider range (3 days). But the ON clause of the JOIN has already restricted you data down to a single day.

The example I gave above simply takes your WHERE clause and put's it in place of the old ON clause.

EDIT

Hive doesn't allow <= and >= in the ON clause? Are you really fixed in to using HIVE???

If you really are, what about BETWEEN?

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

Alternatively, refactor your table of dates to enumerate the dates you want to include...

TABLE t1 (calendar_date, inclusive_date) =
{ 2011-01-03, 2011-01-01
  2011-01-03, 2011-01-02
  2011-01-03, 2011-01-03

  2011-01-04, 2011-01-02
  2011-01-04, 2011-01-03
  2011-01-04, 2011-01-04

  2011-01-05, 2011-01-03
  2011-01-05, 2011-01-04
  2011-01-05, 2011-01-05 }

SELECT
  to_date(t1.calendar_date),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
  to_date(t1.calendar_date)

回复收藏 0 原文

尐偏执 2024-12-14 19:27:00

您需要一个子查询：

尝试这样的操作（我无法测试，因为我没有配置单元），

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE t2.messageid in 
    (
    select t2.messageid from t2 where 
    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
   )
GROUP by to_date(t1.date_id);

关键是使用 t1 中每个日期的子查询，在 t2 中选择正确的记录。

编辑：

在 from 子句中强制子查询您可以尝试以下操作：

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN 

(select userid, createddate  from t2 where 

    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
) as t2

ON (to_date(t2.createddate) = to_date(t1.date_id))  

GROUP by to_date(t1.date_id);

但不知道是否可行。

You need a subquery:

try something like this (i cannot test because i don't have hive)

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE t2.messageid in 
    (
    select t2.messageid from t2 where 
    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
   )
GROUP by to_date(t1.date_id);

the key is that with subquery FOR EACH date in t1, the right records are selected in t2.

EDIT:

Forcing subquery in from clause you could try this:

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN 

(select userid, createddate  from t2 where 

    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
) as t2

ON (to_date(t2.createddate) = to_date(t1.date_id))  

GROUP by to_date(t1.date_id);

but don't know if could work.

回复收藏 0 原文

国粹 2024-12-14 19:27:00

我假设 t1 用于定义 3 天的时间段。我怀疑这种令人费解的方法是由于 Hive 的缺点造成的。
这允许您拥有任意数量的 3 天周期。
尝试以下 2 个查询

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2) 
AND substring(t2.createddate,1,10) <=  substring(t1.date_id,1,10) 
GROUP BY t1.date_id

--或者 --

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON t2.createddate like substring(t1.date_id ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%' 
GROUP BY t1.date_id

后者最大限度地减少 t2 表上的函数调用。我还假设 t1 是 2 中较小的一个。
substring 应返回与 to_date 相同的结果。根据文档， https://cwiki.apache.org/ confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions，to_date 返回字符串数据类型。
对日期数据类型的支持似乎很少，但我对配置单元不熟悉。

I am making an assumption that t1 is used to define the 3 day period. I suspect the puzzling approach is due to Hive's shortcomings.
This allows you to have an arbitrary number of 3 day periods.
Try the following 2 queries

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2) 
AND substring(t2.createddate,1,10) <=  substring(t1.date_id,1,10) 
GROUP BY t1.date_id

--or--

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON t2.createddate like substring(t1.date_id ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%' 
GROUP BY t1.date_id

The latter minimizes the function calls on the t2 table. I am also assuming that t1 is the smaller of the 2.
substring should return the same result as to_date. According to the documentation, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions, to_date returns a string data type.
Support for date data types seems minimal but I am not familiar with hive.

回复收藏 0 原文

~没有更多了~