Postgres 中的位图堆扫描速度非常慢

发布于 2024-09-29 00:03:27 字数 1404 浏览 2 评论 0原文

我有以下包含交通测量数据的简单表：

CREATE TABLE "TrafficData"
(
  "RoadID" character varying NOT NULL,
  "DateID" numeric NOT NULL,
  "ExactDateTime" timestamp NOT NULL,
  "CarsSpeed" numeric NOT NULL,
  "CarsCount" numeric NOT NULL
)
CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID");

RoadID 列唯一标识正在记录数据的道路，而 DateID 标识数据一年中的日期 (1..365) - 基本上是 ExactDateTime 的四舍五入表示。

我有大约 100.000.000 行； “RoadID”列中有 1.000 个不同的值，“DateID”列中有 365 个不同的值。

然后我运行以下查询：

SELECT * FROM "TrafficData"
WHERE "RoadID"='Station_1'
AND "DateID">20100610 AND "DateID"<20100618;

这需要长达三秒的时间才能完成，而且我一生都无法弄清楚为什么。

EXPLAIN ANALYZE 给出以下输出：

Bitmap Heap Scan on "TrafficData"  (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1)
  Recheck Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
  ->  Bitmap Index Scan on "RoadDate_Idx"  (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1)
        Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
Total runtime: 2163.985 ms

我的规格：

Windows 7
Postgres 9.0
4GB RAM

我非常感谢任何有用的指示！

原文

I have the following simple table that contains traffic measurement data:

CREATE TABLE "TrafficData"
(
  "RoadID" character varying NOT NULL,
  "DateID" numeric NOT NULL,
  "ExactDateTime" timestamp NOT NULL,
  "CarsSpeed" numeric NOT NULL,
  "CarsCount" numeric NOT NULL
)
CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID");

The column RoadID uniquely identifies the road whose data is being recorded, while DateID identifies the day of the year (1..365) of the data - basically a rounded off representation of ExactDateTime.

I have about 100.000.000 rows; there are 1.000 distinct values in the column "RoadID" and 365 distinct values in the column "DateID".

I then run the following query:

SELECT * FROM "TrafficData"
WHERE "RoadID"='Station_1'
AND "DateID">20100610 AND "DateID"<20100618;

This takes up to three mind-boggling seconds to finish, and I cannot for the life of me figure out WHY.

EXPLAIN ANALYZE gives me the following output:

Bitmap Heap Scan on "TrafficData"  (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1)
  Recheck Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
  ->  Bitmap Index Scan on "RoadDate_Idx"  (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1)
        Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
Total runtime: 2163.985 ms

My specs:

Windows 7
Postgres 9.0
4GB RAM

I'd greatly appreciate any helpful pointers!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只是我以为 2024-10-06 00:03:27

缓慢的部分显然是从表中获取数据，因为索引访问似乎非常快。您可以优化 RAM 使用参数（请参阅 http://wiki.postgresql.org/wiki/Performance_Optimization 和 http://www.varlena.com/GeneralBits/Tidbits/perf.html< /a>），或者通过发出 CLUSTER 命令来优化表中数据的布局（请参阅 http://www.postgresql.org/docs/8.3/static/sql-cluster.html）。

CLUSTER "TrafficData" USING "RoadDate_Idx";

应该这样做。

The slow part is obviosly fetching the data from the tables, since the index access seems to be very fast. You might either optimize your RAM usage parameters (see http://wiki.postgresql.org/wiki/Performance_Optimization and http://www.varlena.com/GeneralBits/Tidbits/perf.html), or optimize the layout of the data in the table by issuing a CLUSTER command (see http://www.postgresql.org/docs/8.3/static/sql-cluster.html).

CLUSTER "TrafficData" USING "RoadDate_Idx";

should do it.

回复收藏 0 原文

多彩岁月 2024-10-06 00:03:27

添加到丹尼尔的答案中，集群操作是一个一次性过程，重新排列磁盘上的数据。目的是从更少的磁盘块中获取 2000 个结果行。

由于这是虚拟数据，用于了解如何快速查询它，因此我建议重新加载它，其模式更接近生成时的加载方式。我想象数据是一天一次生成的，这将有效地导致 DateID 和磁盘上的位置之间的强相关性。如果是这种情况，那么我要么按 DateID 进行聚类，要么将测试数据拆分为 365 个单独的加载，然后重新加载。

如果没有它，并且随机生成数据，您很可能必须对磁盘头执行 2000 多次搜索。

我还会检查您在 Windows 7 上运行的其他任何内容是否不会增加您不需要的读取时间，例如确保读取的块不包含病毒签名，或者同时执行自动计划的磁盘碎片整理（导致磁盘头几乎不会接近上次读取数据库块的位置）。

回复收藏 0 原文

云巢 2024-10-06 00:03:27

4GB 内存 -> 6+ 您有 100M 记录，这并不大，但对于台式机来说内存可能很重要。如果这不是桌面，我不知道为什么你会有这么小的内存
AND "DateID">20100610 AND "DateID"<20100618; -> DateID BETWEEN 20100611 AND 20100617;
在 DateID 上创建索引
删除字段名称周围的所有双引号
将 RoadID 设为文本字段而不是 VarChar