在使用分析功能作为视图的一部分时,在DATETIME字段上使用分析函数时进行修剪,并带有一天的粒度
我正在尝试使用分析功能(例如first_value
),同时仍然从分区修剪中受益。这是在用日粒度分区的桌子分区的。
示例数据
让我们假设一个具有以下列的表:
名称 | 类型 |
---|---|
dt | dateTime |
value | 字符串字符串 |
该表在dt
用Day划分为DT
粒度。
可以使用以下SQL创建示例表:
CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
生成的表将超过1 GB(每天约3.4 MB)。
现在的问题
我想在每个分区中获得第一个值。 (后来我实际上想进一步分解),
因为我想使用视图,视图本身不知道最终日期范围。在示例查询中,我使用临时表t_view
代替视图。
WITH t_view AS (
SELECT
dt,
value,
FIRST_VALUE(value) OVER(
PARTITION BY DATE(dt)
ORDER BY dt
) AS first_val
FROM `project.dataset.example_partioned_table`
)
SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')
该查询将包含一些值之类的内容:2020-01-01 00:00:00
for first_val
(即日期的第一个值)。
但是,就目前而言,当它应该扫描分区时,它正在扫描整个桌子(超过1 GB)。
其他观察结果
如果我不包括first_val
(分析功能),则分区修剪按预期工作。 包括first_val
导致其扫描所有内容。
如果我不将dt
与date
包装,则分区修剪也有效,但当然不会提供正确的值。
我还尝试了datetime_trunc(request.timestamp,day)
,与date(request.timestamp)
相同的缺乏分区修剪结果。
还添加临时表中子句工作的日期,但我不知道视图内的日期范围。
如何将分析函数限制为行分区?
失败的解决方法使用组
相关的组,我还尝试了使用(日期)组使用组的解决方法,结果与
WITH t_view_1 AS (
SELECT
dt,
DATE(dt) AS _date,
value,
FROM `project.dataset.example_partioned_table`
),
t_view_2 AS (
SELECT
_date,
MIN(value) AS first_val
FROM t_view_1
GROUP BY _date
),
t_view AS (
SELECT
t_view_1._date,
t_view_1.dt,
t_view_2.first_val
FROM t_view_1
JOIN t_view_2
ON t_view_2._date = t_view_1._date
)
SELECT *,
FROM t_view
WHERE _date = '2020-01-01'
以前相同,它正在扫描整个表,而不仅仅是使用选定的日期处理分区。
然后
如果表在日期字段(_DATE
)分区(例如:
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
,分区修剪可以与以下调整后的示例查询一起使用:
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')
IE查询扫描大约4 MB) 而不是1 GB
但是,现在我需要添加和填充该附加_DATE
字段, 。 (具有外部数据源的不方便)
具有两个带有冗余信息的字段也可能令人困惑。
此外,现在根本没有分区修剪dt
(查询需要确保使用_date
而不是)。
I am trying to use analytical functions (e.g. FIRST_VALUE
), while still benefiting from from partition pruning. This while the table is partitioned on a DATETIME field with DAY granularity.
Example Data
Let's assume a table with the following columns:
name | type |
---|---|
dt | DATETIME |
value | STRING |
The table is partitioned on dt
with the DAY
granularity.
An example table can be created using the following SQL:
CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
The generated table will be over 1 GB (around 3.4 MB per day).
The Problem
Now I want to get the first value in each partition. (Later I actually want to have a further breakdown)
As I want to use a view, the view itself wouldn't know the final date range. In the example query I using a temporary table t_view
in place of the view.
WITH t_view AS (
SELECT
dt,
value,
FIRST_VALUE(value) OVER(
PARTITION BY DATE(dt)
ORDER BY dt
) AS first_val
FROM `project.dataset.example_partioned_table`
)
SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')
The query will contain something like some value: 2020-01-01 00:00:00
for first_val
(i.e. first value for the date).
However, as it stands, it is scanning the whole table (over 1 GB), when it should just scan the partition.
Other observations
If I don't include first_val
(the analytical function) in the result, then the partition pruning works as intended.
Including first_val
causes it to scan everything.
If I don't wrap dt
with DATE
, then the partition pruning also works, but would of course not provide the correct value.
I also tried DATETIME_TRUNC(request.timestamp, DAY)
, with the same lacking partition pruning result as DATE(request.timestamp)
.
Also adding the date where clause inside the temporary table works, but I wouldn't know the date range inside the view.
How can I restrict the analytical function to the partition of the row?
Failed workaround using GROUP BY
Related, I also tried a workaround using GROUP BY (the date), with the same result...
WITH t_view_1 AS (
SELECT
dt,
DATE(dt) AS _date,
value,
FROM `project.dataset.example_partioned_table`
),
t_view_2 AS (
SELECT
_date,
MIN(value) AS first_val
FROM t_view_1
GROUP BY _date
),
t_view AS (
SELECT
t_view_1._date,
t_view_1.dt,
t_view_2.first_val
FROM t_view_1
JOIN t_view_2
ON t_view_2._date = t_view_1._date
)
SELECT *,
FROM t_view
WHERE _date = '2020-01-01'
As before, it is scanning the whole table rather than only processing the partition with the selected date.
Potentially working workaround with partition on DATE field
If the table is instead partitioned on a DATE field (_date
), e.g.:
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
Then the partition pruning works with the following adjusted example query:
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')
i.e. the query scans around 4 MB rather than 1 GB
However, now I would need to add and populate that additional _date
field. (Inconvenient with an external data source)
Having two fields with redundant information can also be confusing.
Additionally there is now no partition pruning at all on dt
(queries need to make sure to use _date
instead).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
BQ功能有时会导致查询优化器做出一些效率低下的选择,但是我们一直在尝试改进查询优化器。
因此,在您的方案中,最好的解决方法是添加额外的列日期列并使用它来分区表。
IE。
BQ functions can sometimes lead the query optimizer to make some inefficient choices, however we’re constantly trying to improve the query optimizer.
So, the best possible workaround in your scenario would be adding an extra column date column and using it to partition the table.
Ie.