确定盒子之间的距离
这个问题让我头疼了一段时间。我有一个 PostgreSQL 8.4 数据库,该数据库仅包含一个包含 4.000.000 多条记录的表。该表的结构如下:
CREATE TABLE metadata (
id serial NOT NULL,
"value" text NOT NULL DEFAULT ''::text,
segment_box box NOT NULL DEFAULT box(point(((-9223372036854775808)::bigint)::double precision, (1)::double precision), point((9223372036854775807::bigint)::double precision, ((-1))::double precision)),
CONSTRAINT metadata_pk PRIMARY KEY (id)
)
CREATE INDEX metadata_segment_box_ix
ON metadata
USING gist
(segment_box);
CREATE INDEX metadata_tag_value_ix
ON metadata
USING btree
(value);
该表包含段(按时间),表示为矩形框。这些段使用“值”列进行注释。
我想在数据库上执行的查询类型尝试查找特定窗口中包含的具有指定值的所有段。成功实现此目的的查询是:
SELECT * FROM (SELECT * FROM metadata WHERE value='X') a,
(SELECT * FROM metadata WHERE AND value='Y') b
WHERE a.segment_box <-> b.segment_box <= 3000
但是,正如您可能注意到的那样,数据库无法有效地执行此查询。子查询 a 和 b 的笛卡尔积变得非常大。有没有办法更有效地执行这些查询?我可以想象某种滑动窗口方法可以解决这个问题。也许类似于以下内容:
SELECT *, rank() OVER (
PARTITION BY "value" ORDER BY (segment_box[1])[0], (segment_box[0])[0]
) FROM metadata WHERE value='X' OR value='Y'
更新: 发布这个问题后我尝试的一件事是在 Postgres 中创建一个自定义函数。我尝试过:
CREATE OR REPLACE FUNCTION within_window(size bigint DEFAULT 0)
RETURNS setof metadata AS
$BODY$DECLARE
segment RECORD;
neighbour RECORD;
newwindow box;
BEGIN
FOR segment IN (
SELECT * FROM metadata WHERE value='X' OR value='Y'
ORDER BY (segment_box[1])[0], (segment_box[0])[0]
) LOOP
newwindow := box(segment.segment_box[0],
point((((segment.segment_box[1])[0]) + size), (segment.segment_box[1])[1]));
FOR neighbour IN (
SELECT DISTINCT ON (metadata_id) * FROM metadata WHERE value='X' OR value='Y')
AND segment_box &< newwindow
AND segment_box &> newwindow
) LOOP
RETURN NEXT neighbour;
END LOOP;
END LOOP;
END;$BODY$
LANGUAGE plpgsql;
但是,由于必须执行多次子查询,因此该函数与我上面描述的基本解决方案一样慢。对此还有其他想法吗?
This problem is causing me a headache for a while. I've a PostgreSQL 8.4 database that consists of only one table containing more than 4.000.000 records. This table is structured as follows:
CREATE TABLE metadata (
id serial NOT NULL,
"value" text NOT NULL DEFAULT ''::text,
segment_box box NOT NULL DEFAULT box(point(((-9223372036854775808)::bigint)::double precision, (1)::double precision), point((9223372036854775807::bigint)::double precision, ((-1))::double precision)),
CONSTRAINT metadata_pk PRIMARY KEY (id)
)
CREATE INDEX metadata_segment_box_ix
ON metadata
USING gist
(segment_box);
CREATE INDEX metadata_tag_value_ix
ON metadata
USING btree
(value);
The table contains segments (in time), represented as rectangular boxes. These segments are annotated using the "value" column.
The kind of queries I would like to perform on the database try to find all segments with a specified value that is contained within a certain window. A query that successfully achieves this is:
SELECT * FROM (SELECT * FROM metadata WHERE value='X') a,
(SELECT * FROM metadata WHERE AND value='Y') b
WHERE a.segment_box <-> b.segment_box <= 3000
But, as you probably noticed, this query cannot be performed efficiently by the database. The cartesian product of sub-queries a and b is becoming really large. Is there a way to perform these queries more efficiently? I can imagine some sort of sliding window approach would do the trick. Maybe something like the following:
SELECT *, rank() OVER (
PARTITION BY "value" ORDER BY (segment_box[1])[0], (segment_box[0])[0]
) FROM metadata WHERE value='X' OR value='Y'
Update:
One of the things I tried after posting this question is creating a custom function in Postgres. I tried:
CREATE OR REPLACE FUNCTION within_window(size bigint DEFAULT 0)
RETURNS setof metadata AS
$BODY$DECLARE
segment RECORD;
neighbour RECORD;
newwindow box;
BEGIN
FOR segment IN (
SELECT * FROM metadata WHERE value='X' OR value='Y'
ORDER BY (segment_box[1])[0], (segment_box[0])[0]
) LOOP
newwindow := box(segment.segment_box[0],
point((((segment.segment_box[1])[0]) + size), (segment.segment_box[1])[1]));
FOR neighbour IN (
SELECT DISTINCT ON (metadata_id) * FROM metadata WHERE value='X' OR value='Y')
AND segment_box &< newwindow
AND segment_box &> newwindow
) LOOP
RETURN NEXT neighbour;
END LOOP;
END LOOP;
END;$BODY$
LANGUAGE plpgsql;
However, this function is as slow as the basic solution I described above because of the subquery that must be performed many times. Any other thoughts on this??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我自己用一种扫线算法解决了这个问题。仅执行一次查询。我使用光标来回扫过查询的结果集。结果算法的工作原理如下:
使用此函数,查询只需要 476 毫秒,而不是 6 分钟以上(在 4+ 百万行数据库上)!
I solved the problem myself with a kind of sweep line algorithm. Only one query is performed. I use a cursor to sweep back and forth over the query's resultset. The resulting algorithm works as follows:
Using this function the query only needs 476 ms instead of 6+ minutes (on the 4+ million row database)!