greenplum分区优化

发布于 2024-12-21 13:02:27 字数 2113 浏览 2 评论 0原文

在 greenplum 上,我有一个名为 fact_table 的大表,它由 RANGE(day_bucket) 分区。为什么以下查询如此慢:

select max(day_bucket) from fact_table where day_bucket >= '2011-09-11 00:00:00' and day_bucket < '2011-12-14'.

我想它应该只查看每个分区的头部并立即返回结果,因为每个分区都有相同的 day_bucket 列。但 greenplum 进行了 FULL 扫描来计算结果。任何人都可以向我解释原因吗?


更新:

感谢您回答我的问题,但这对您的提示没有帮助。 Greenplum 总是进行全面扫描,即使我使用 PARTITION BY LIST(day_bucket) 创建表:

CREATE TABLE fact_table (
    id character varying(25) NOT NULL,
    day_bucket timestamp without time zone NOT NULL,
)
WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=6) DISTRIBUTED BY (user_id) PARTITION BY LIST(day_bucket) 
          (
          PARTITION p20120101 VALUES ('2012-01-01 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120101', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120102 VALUES ('2012-01-02 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120102', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120103 VALUES ('2012-01-03 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120103', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120104 VALUES ('2012-01-04 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120104', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
       .....

Explain 命令显示它总是进行全面扫描:

-> Append-only Columnar Scan on mytestlist_1_prt_p20120102 mytestlist (cost=0.00..34.95 rows=1 width=8) Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket Append-only Columnar Scan on mytestlist_1_prt_p20120103 mytestlist (cost=0.00..39.61 rows=1 width=8) Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket

On greenplum, I have a big table called fact_table which is partitioned by RANGE(day_bucket). Why is so slow for the following query:

select max(day_bucket) from fact_table where day_bucket >= '2011-09-11 00:00:00' and day_bucket < '2011-12-14'.

I suppose it should just look at the head of each partition and return the result immediately, since each partition of the same day_bucket column. But greenplum did a FULL scan to calculate the result. Anybody can explain to me the reason?


Update:

Thanks for answering my question, but it doesn't help with your tip. Greenplum always do a full scan, even though I create the table with PARTITION BY LIST(day_bucket):

CREATE TABLE fact_table (
    id character varying(25) NOT NULL,
    day_bucket timestamp without time zone NOT NULL,
)
WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=6) DISTRIBUTED BY (user_id) PARTITION BY LIST(day_bucket) 
          (
          PARTITION p20120101 VALUES ('2012-01-01 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120101', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120102 VALUES ('2012-01-02 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120102', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120103 VALUES ('2012-01-03 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120103', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
          PARTITION p20120104 VALUES ('2012-01-04 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120104', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), 
       .....

Explain command shows that it always do a full scan:

-> Append-only Columnar Scan on mytestlist_1_prt_p20120102 mytestlist (cost=0.00..34.95 rows=1 width=8)
Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket Append-only Columnar Scan on mytestlist_1_prt_p20120103 mytestlist (cost=0.00..39.61 rows=1 width=8)
Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不离久伴 2024-12-28 13:02:27

您应该注意应用于分区的约束。
为了让优化器正确地从扫描中排除某些分区,你应该帮助他。在你的情况下,你应该使用显式类型转换:(GP无法在规划阶段自动理解像“yyyy-mm-dd”这样的刺痛实际上是时间戳)

select max(day_bucket) 
from fact_table 
where day_bucket >= '2011-09-11 00:00:00'::timestamp 
  and day_bucket <  '2011-12-14'::timestamp

You should pay attention to constraints applied to your partitions.
To allow optimizer correctly exclude some partitions from scanning, you should help him. In your case, you should use explicit typecasting: (GP could not automatically understand on the planning stage that stings like 'yyyy-mm-dd' is actually timestamp)

select max(day_bucket) 
from fact_table 
where day_bucket >= '2011-09-11 00:00:00'::timestamp 
  and day_bucket <  '2011-12-14'::timestamp
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文