Postgres 中的位图堆扫描速度非常慢
我有以下包含交通测量数据的简单表:
CREATE TABLE "TrafficData"
(
"RoadID" character varying NOT NULL,
"DateID" numeric NOT NULL,
"ExactDateTime" timestamp NOT NULL,
"CarsSpeed" numeric NOT NULL,
"CarsCount" numeric NOT NULL
)
CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID");
RoadID 列唯一标识正在记录数据的道路,而 DateID 标识数据一年中的日期 (1..365) - 基本上是 ExactDateTime 的四舍五入表示。
我有大约 100.000.000 行; “RoadID”列中有 1.000 个不同的值,“DateID”列中有 365 个不同的值。
然后我运行以下查询:
SELECT * FROM "TrafficData"
WHERE "RoadID"='Station_1'
AND "DateID">20100610 AND "DateID"<20100618;
这需要长达三秒的时间才能完成,而且我一生都无法弄清楚为什么。
EXPLAIN ANALYZE 给出以下输出:
Bitmap Heap Scan on "TrafficData" (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1)
Recheck Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
-> Bitmap Index Scan on "RoadDate_Idx" (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1)
Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
Total runtime: 2163.985 ms
我的规格:
- Windows 7
- Postgres 9.0
- 4GB RAM
我非常感谢任何有用的指示!
I have the following simple table that contains traffic measurement data:
CREATE TABLE "TrafficData"
(
"RoadID" character varying NOT NULL,
"DateID" numeric NOT NULL,
"ExactDateTime" timestamp NOT NULL,
"CarsSpeed" numeric NOT NULL,
"CarsCount" numeric NOT NULL
)
CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID");
The column RoadID uniquely identifies the road whose data is being recorded, while DateID identifies the day of the year (1..365) of the data - basically a rounded off representation of ExactDateTime.
I have about 100.000.000 rows; there are 1.000 distinct values in the column "RoadID" and 365 distinct values in the column "DateID".
I then run the following query:
SELECT * FROM "TrafficData"
WHERE "RoadID"='Station_1'
AND "DateID">20100610 AND "DateID"<20100618;
This takes up to three mind-boggling seconds to finish, and I cannot for the life of me figure out WHY.
EXPLAIN ANALYZE gives me the following output:
Bitmap Heap Scan on "TrafficData" (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1)
Recheck Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
-> Bitmap Index Scan on "RoadDate_Idx" (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1)
Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
Total runtime: 2163.985 ms
My specs:
- Windows 7
- Postgres 9.0
- 4GB RAM
I'd greatly appreciate any helpful pointers!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
缓慢的部分显然是从表中获取数据,因为索引访问似乎非常快。您可以优化 RAM 使用参数(请参阅 http://wiki.postgresql.org/wiki/Performance_Optimization 和 http://www.varlena.com/GeneralBits/Tidbits/perf.html< /a>),或者通过发出 CLUSTER 命令来优化表中数据的布局(请参阅 http://www.postgresql.org/docs/8.3/static/sql-cluster.html)。
应该这样做。
The slow part is obviosly fetching the data from the tables, since the index access seems to be very fast. You might either optimize your RAM usage parameters (see http://wiki.postgresql.org/wiki/Performance_Optimization and http://www.varlena.com/GeneralBits/Tidbits/perf.html), or optimize the layout of the data in the table by issuing a CLUSTER command (see http://www.postgresql.org/docs/8.3/static/sql-cluster.html).
should do it.
添加到丹尼尔的答案中,集群操作是一个一次性过程,重新排列磁盘上的数据。目的是从更少的磁盘块中获取 2000 个结果行。
由于这是虚拟数据,用于了解如何快速查询它,因此我建议重新加载它,其模式更接近生成时的加载方式。我想象数据是一天一次生成的,这将有效地导致
DateID
和磁盘上的位置之间的强相关性。如果是这种情况,那么我要么按DateID
进行聚类,要么将测试数据拆分为 365 个单独的加载,然后重新加载。如果没有它,并且随机生成数据,您很可能必须对磁盘头执行 2000 多次搜索。
我还会检查您在 Windows 7 上运行的其他任何内容是否不会增加您不需要的读取时间,例如确保读取的块不包含病毒签名,或者同时执行自动计划的磁盘碎片整理(导致磁盘头几乎不会接近上次读取数据库块的位置)。
Adding to Daniel's answer, the cluster operation is a one off process that rearranged the data on disk. The intent is to get your 2000 results rows from fewer disk blocks.
As this is dummy data, being used to find out how you can quickly query it, I'd recommend reloading it, in a pattern closer to how it will be loaded as it is generated. I imagine that the data is generated one day at a time, which will effectively result in strong correlation between
DateID
and the location on disk. If that is the case, then I'd either cluster byDateID
, or split your test data into 365 separate loads, and reload it.Without that, and having randomly generated data, you're most likely having to perform over 2000 seeks of your disk head.
I'd also check that anything else you're running on Windows 7 isn't adding time to those reads that you don't need, such as ensuring that the blocks read do not contain virus signatures, or concurrently performing an automatically scheduled disk defragmentation (resulting in the disk head hardly ever being anywhere close to where it was last time a database block was read).
AND "DateID">20100610 AND "DateID"<20100618;
->DateID BETWEEN 20100611 AND 20100617;
RoadID
设为文本字段而不是 VarCharAND "DateID">20100610 AND "DateID"<20100618;
->DateID BETWEEN 20100611 AND 20100617;
RoadID
a text field