greenplum分区优化
在 greenplum 上,我有一个名为 fact_table
的大表,它由 RANGE(day_bucket)
分区。为什么以下查询如此慢:
select max(day_bucket) from fact_table where day_bucket >= '2011-09-11 00:00:00' and day_bucket < '2011-12-14'.
我想它应该只查看每个分区的头部并立即返回结果,因为每个分区都有相同的 day_bucket
列。但 greenplum 进行了 FULL 扫描来计算结果。任何人都可以向我解释原因吗?
更新:
感谢您回答我的问题,但这对您的提示没有帮助。 Greenplum 总是进行全面扫描,即使我使用 PARTITION BY LIST(day_bucket) 创建表:
CREATE TABLE fact_table ( id character varying(25) NOT NULL, day_bucket timestamp without time zone NOT NULL, ) WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=6) DISTRIBUTED BY (user_id) PARTITION BY LIST(day_bucket) ( PARTITION p20120101 VALUES ('2012-01-01 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120101', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), PARTITION p20120102 VALUES ('2012-01-02 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120102', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), PARTITION p20120103 VALUES ('2012-01-03 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120103', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), PARTITION p20120104 VALUES ('2012-01-04 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120104', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), .....
Explain 命令显示它总是进行全面扫描:
-> Append-only Columnar Scan on mytestlist_1_prt_p20120102 mytestlist (cost=0.00..34.95 rows=1 width=8) Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket Append-only Columnar Scan on mytestlist_1_prt_p20120103 mytestlist (cost=0.00..39.61 rows=1 width=8) Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucketOn greenplum, I have a big table called fact_table
which is partitioned by RANGE(day_bucket)
. Why is so slow for the following query:
select max(day_bucket) from fact_table where day_bucket >= '2011-09-11 00:00:00' and day_bucket < '2011-12-14'.
I suppose it should just look at the head of each partition and return the result immediately, since each partition of the same day_bucket
column. But greenplum did a FULL scan to calculate the result. Anybody can explain to me the reason?
Update:
Thanks for answering my question, but it doesn't help with your tip. Greenplum always do a full scan, even though I create the table with PARTITION BY LIST(day_bucket):
CREATE TABLE fact_table ( id character varying(25) NOT NULL, day_bucket timestamp without time zone NOT NULL, ) WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=6) DISTRIBUTED BY (user_id) PARTITION BY LIST(day_bucket) ( PARTITION p20120101 VALUES ('2012-01-01 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120101', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), PARTITION p20120102 VALUES ('2012-01-02 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120102', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), PARTITION p20120103 VALUES ('2012-01-03 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120103', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), PARTITION p20120104 VALUES ('2012-01-04 00:00:00'::timestamp without time zone) WITH (tablename='fact_table_1_prt_p20120104', appendonly=true, orientation=column, compresstype=zlib, compresslevel=6 ), .....
Explain command shows that it always do a full scan:
-> Append-only Columnar Scan on mytestlist_1_prt_p20120102 mytestlist (cost=0.00..34.95 rows=1 width=8)
Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket Append-only Columnar Scan on mytestlist_1_prt_p20120103 mytestlist (cost=0.00..39.61 rows=1 width=8)
Filter: day_bucket >= '2012-01-02 00:00:00'::timestamp without time zone AND day_bucket
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该注意应用于分区的约束。
为了让优化器正确地从扫描中排除某些分区,你应该帮助他。在你的情况下,你应该使用显式类型转换:(GP无法在规划阶段自动理解像“yyyy-mm-dd”这样的刺痛实际上是时间戳)
You should pay attention to constraints applied to your partitions.
To allow optimizer correctly exclude some partitions from scanning, you should help him. In your case, you should use explicit typecasting: (GP could not automatically understand on the planning stage that stings like 'yyyy-mm-dd' is actually timestamp)