在使用分析功能作为视图的一部分时,在DATETIME字段上使用分析函数时进行修剪,并带有一天的粒度

发布于 2025-01-25 09:58:41 字数 3906 浏览 2 评论 0原文

我正在尝试使用分析功能(例如first_value),同时仍然从分区修剪中受益。这是在用日粒度分区的桌子分区的。

示例数据

让我们假设一个具有以下列的表:

名称类型
dtdateTime
value字符串字符串

该表在dtDay划分为DT 粒度。

可以使用以下SQL创建示例表:

CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

生成的表将超过1 GB(每天约3.4 MB)。

现在的问题

我想在每个分区中获得第一个值。 (后来我实际上想进一步分解),

因为我想使用视图,视图本身不知道最终日期范围。在示例查询中,我使用临时表t_view代替视图。

WITH t_view AS (
    SELECT
        dt,
        value,
        FIRST_VALUE(value) OVER(
            PARTITION BY DATE(dt)
            ORDER BY dt
        ) AS first_val
    FROM `project.dataset.example_partioned_table`
)

SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')

该查询将包含一些值之类的内容:2020-01-01 00:00:00 for first_val(即日期的第一个值)。

但是,就目前而言,当它应该扫描分区时,它正在扫描整个桌子(超过1 GB)。

其他观察结果

如果我不包括first_val(分析功能),则分区修剪按预期工作。 包括first_val导致其扫描所有内容。

如果我不将dtdate 包装,则分区修剪也有效,但当然不会提供正确的值。

我还尝试了datetime_trunc(request.timestamp,day),与date(request.timestamp)相同的缺乏分区修剪结果。

还添加临时表中子句工作的日期,但我不知道视图内的日期范围。

如何将分析函数限制为行分区?

失败的解决方法使用组

相关的组,我还尝试了使用(日期)组使用组的解决方法,结果与

WITH t_view_1 AS (
    SELECT
        dt,
        DATE(dt) AS _date,
        value,
    FROM `project.dataset.example_partioned_table`
),

t_view_2 AS (
    SELECT
        _date,
        MIN(value) AS first_val
    FROM t_view_1
    GROUP BY _date
),

t_view AS (
    SELECT
        t_view_1._date,
        t_view_1.dt,
        t_view_2.first_val
    FROM t_view_1
    JOIN t_view_2
        ON t_view_2._date = t_view_1._date
)

SELECT *,
FROM t_view
WHERE _date = '2020-01-01'

以前相同,它正在扫描整个表,而不仅仅是使用选定的日期处理分区。

然后

如果表在日期字段(_DATE)分区(例如:

CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

,分区修剪可以与以下调整后的示例查询一起使用:

WITH t_view AS (
    SELECT
        dt,
        _date,
        value,
        FIRST_VALUE(value) OVER(
            PARTITION BY _date
            ORDER BY dt
        ) AS first_val
    FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)

SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')

IE查询扫描大约4 MB) 而不是1 GB

但是,现在我需要添加和填充该附加_DATE字段, 。 (具有外部数据源的不方便)

具有两个带有冗余信息的字段也可能令人困惑。

此外,现在根本没有分区修剪dt(查询需要确保使用_date而不是)。

I am trying to use analytical functions (e.g. FIRST_VALUE), while still benefiting from from partition pruning. This while the table is partitioned on a DATETIME field with DAY granularity.

Example Data

Let's assume a table with the following columns:

nametype
dtDATETIME
valueSTRING

The table is partitioned on dt with the DAY granularity.

An example table can be created using the following SQL:

CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

The generated table will be over 1 GB (around 3.4 MB per day).

The Problem

Now I want to get the first value in each partition. (Later I actually want to have a further breakdown)

As I want to use a view, the view itself wouldn't know the final date range. In the example query I using a temporary table t_view in place of the view.

WITH t_view AS (
    SELECT
        dt,
        value,
        FIRST_VALUE(value) OVER(
            PARTITION BY DATE(dt)
            ORDER BY dt
        ) AS first_val
    FROM `project.dataset.example_partioned_table`
)

SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')

The query will contain something like some value: 2020-01-01 00:00:00 for first_val (i.e. first value for the date).

However, as it stands, it is scanning the whole table (over 1 GB), when it should just scan the partition.

Other observations

If I don't include first_val (the analytical function) in the result, then the partition pruning works as intended.
Including first_val causes it to scan everything.

If I don't wrap dt with DATE, then the partition pruning also works, but would of course not provide the correct value.

I also tried DATETIME_TRUNC(request.timestamp, DAY), with the same lacking partition pruning result as DATE(request.timestamp).

Also adding the date where clause inside the temporary table works, but I wouldn't know the date range inside the view.

How can I restrict the analytical function to the partition of the row?

Failed workaround using GROUP BY

Related, I also tried a workaround using GROUP BY (the date), with the same result...

WITH t_view_1 AS (
    SELECT
        dt,
        DATE(dt) AS _date,
        value,
    FROM `project.dataset.example_partioned_table`
),

t_view_2 AS (
    SELECT
        _date,
        MIN(value) AS first_val
    FROM t_view_1
    GROUP BY _date
),

t_view AS (
    SELECT
        t_view_1._date,
        t_view_1.dt,
        t_view_2.first_val
    FROM t_view_1
    JOIN t_view_2
        ON t_view_2._date = t_view_1._date
)

SELECT *,
FROM t_view
WHERE _date = '2020-01-01'

As before, it is scanning the whole table rather than only processing the partition with the selected date.

Potentially working workaround with partition on DATE field

If the table is instead partitioned on a DATE field (_date), e.g.:

CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

Then the partition pruning works with the following adjusted example query:

WITH t_view AS (
    SELECT
        dt,
        _date,
        value,
        FIRST_VALUE(value) OVER(
            PARTITION BY _date
            ORDER BY dt
        ) AS first_val
    FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)

SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')

i.e. the query scans around 4 MB rather than 1 GB

However, now I would need to add and populate that additional _date field. (Inconvenient with an external data source)

Having two fields with redundant information can also be confusing.

Additionally there is now no partition pruning at all on dt (queries need to make sure to use _date instead).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

樱桃奶球 2025-02-01 09:58:41

BQ功能有时会导致查询优化器做出一些效率低下的选择,但是我们一直在尝试改进查询优化器。

因此,在您的方案中,最好的解决方法是添加额外的列日期列并使用它来分区表。

IE。

CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

WITH t_view AS (
   SELECT
       dt,
       _date,
       value,
       FIRST_VALUE(value) OVER(
           PARTITION BY _date
           ORDER BY dt
       ) AS first_val
   FROM `mock.example_date_field_partioned_table`
)
 
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')

BQ functions can sometimes lead the query optimizer to make some inefficient choices, however we’re constantly trying to improve the query optimizer.

So, the best possible workaround in your scenario would be adding an extra column date column and using it to partition the table.

Ie.

CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
    SELECT
        DATETIME_ADD(
            DATETIME(_date),
            INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
        ) AS dt
    FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
    CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)

WITH t_view AS (
   SELECT
       dt,
       _date,
       value,
       FIRST_VALUE(value) OVER(
           PARTITION BY _date
           ORDER BY dt
       ) AS first_val
   FROM `mock.example_date_field_partioned_table`
)
 
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文