在 MySQL 中使用空间索引时性能不佳

发布于 2024-10-13 22:29:12 字数 2753 浏览 6 评论 0原文

我正在尝试进行一个小实验来推送一个不是地理空间的数据集,但非常适合它,并且发现结果有些令人不安。数据集是基因组数据,例如人类基因组,其中有一个 DNA 区域,其中基因等元素占据特定的起始和终止坐标(我们的 X 轴)。我们有多个占据 Y 轴的 DNA 区域(染色体)。目标是返回沿单个 Y 坐标与两个 X 坐标相交的所有项目,例如 LineString(START 1, END 2)。

这个理论看起来很合理,所以我将它推入一个现有的基于 MySQL 的基因组项目中,并提出了一个表结构,如下所示:

CREATE TABLE `spatial_feature` (
  `spatial_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `external_id` int(10) unsigned NOT NULL,
  `external_type` int(3) unsigned NOT NULL,
  `location` geometry NOT NULL,
  PRIMARY KEY (`spatial_feature_id`),
  SPATIAL KEY `sf_location_idx` (`location`)
) ENGINE=MyISAM;

external_id 代表我们编码到此中的实体的标识符表和external_type 对其来源进行编码。一切看起来都不错,我输入了一些初步数据(30,000 行),看起来效果很好。当这一数字超过 300 万行标记时,MySQL 拒绝使用空间索引,并且在被迫使用空间索引时速度会变慢(使用全表扫描需要 40 秒,而使用全表扫描则需要 5 秒)。当添加更多数据时,开始使用索引,但性能损失仍然存在。强制关闭索引将查询时间缩短至 8 秒。我使用的查询如下所示:

select count(*)
from spatial_feature
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

输入的数据沿 Y 维度非常密集(就像您记录了一条很长的道路上每座建筑物、电话亭、邮政信箱和鸽子的位置一样) 。我已经在 J​​ava 中测试了 R 索引如何处理这些数据,并且该领域的其他人已经成功地将它们应用于平面文件格式。然而,没有人将它们应用到数据库中,据我所知,这是本次测试的目标。

当将大量数据添加到沿着特定轴不是非常不同的空间模型时,是否有人看到类似的行为?如果我反转坐标使用,问题仍然存在。如果这是一个原因,我正在运行以下设置

  • MacOS 10.6.6
  • MySQL 5.1.46

帮助!

中引入解释计划

+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| id | select_type | table           | type | possible_keys   | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | spatial_feature | ALL  | sf_location_idx | NULL | NULL    | NULL | 3636060 |    33.33 | Using where |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

还在重写的 SQL

select count(0) AS `count(*)` from `arabidopsis_thaliana_core_7_60_9`.`spatial_feature` where intersects(geometryfromtext('LineString(7420023 1, 7420023 1)'),`arabidopsis_thaliana_core_7_60_9`.`spatial_feature`.`location`)

,如下所示仍然没有突出显示为什么该查询的性能如此差

在阅读了 @Fraser 从 rickonrails 看来问题全都与索引不在内存中有关。如果我将类似的技术应用于本文中提到的技术(使键缓冲区确实非常大),然后我强制查询使用索引查询时间。我们仍然看到查询区域和查询区域之间存在滞后。然后搜索该区域的子集,但这一切都指向正确加载索引。

这个故事的寓意是什么? MySQL 中的 R 索引性能相当差,直到进入内存后才具有出色的性能。对于我想要用它们做的事情来说,这并不是一个很好的解决方案,但它仍然提供了一个关于 MySQL 的有趣角度。

感谢所有帮助的人。

I'm trying a little experiment at pushing a data set which is not geo-spatial but fits it quite well and am finding the results somewhat unsettling. The data set is genomic data e.g. the Human Genome where we have a region of DNA where elements like genes occupy specific start and stop coordinates (our X axis). We have multiple regions of DNA (chromosomes) which occupy the Y axis. The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate e.g. LineString(START 1, END 2).

The theory seemed sound so I pushed it into an existing MySQL based genome project and came up with a table structure like:

CREATE TABLE `spatial_feature` (
  `spatial_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `external_id` int(10) unsigned NOT NULL,
  `external_type` int(3) unsigned NOT NULL,
  `location` geometry NOT NULL,
  PRIMARY KEY (`spatial_feature_id`),
  SPATIAL KEY `sf_location_idx` (`location`)
) ENGINE=MyISAM;

external_id represents the identifier of the entity we have encoded into this table & external_type encodes the source of this. Everything looked good and I pushed in some preliminary data (30,000 rows) which seemed to work well. When this increased past the 3 million row mark MySQL refused to use the spatial index and was slower when it was forced to use it (40 seconds vs. 5 seconds using a full table scan). When more data was added the index started to be used but the performance penalty persisted. Forcing the index off brought the query down to 8 seconds. The query I'm using looks like:

select count(*)
from spatial_feature
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

The data going into this is be very dense along the Y dimensions (think of it like you've recorded the position of every building, telephone box, post box and pigeon on a very long road). I've done tests of how R-Indexes behave with this data in Java as well as others in the field have applied them to flat-file formats with success. However no one has applied them to databases AFAIK which is the goal of this test.

Has anyone out there seen a similar behaviour when adding large quantities of data to a spatial model which is not very disparate along a particular axis? The problem persists if I reverse the coordinate usage. I'm running the following setup if that's a cause

  • MacOS 10.6.6
  • MySQL 5.1.46

Help!

Also bringing in explain plan in

+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| id | select_type | table           | type | possible_keys   | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | spatial_feature | ALL  | sf_location_idx | NULL | NULL    | NULL | 3636060 |    33.33 | Using where |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

The re-written SQL looks like the following

select count(0) AS `count(*)` from `arabidopsis_thaliana_core_7_60_9`.`spatial_feature` where intersects(geometryfromtext('LineString(7420023 1, 7420023 1)'),`arabidopsis_thaliana_core_7_60_9`.`spatial_feature`.`location`)

Still not highlighting why this query's performance is so poor

After reading the article posted by @Fraser from rickonrails it seems like the problem is all to do with the index not being in memory. If I apply similar techniques to those mentioned in the article (making key buffer very big indeed) and I then force the query to use the index query times plumet. We still see a lag between querying a region & then searching for a subset of the region but it's all pointing to getting the load of the indexes correct.

What's the moral of the story? R-Indexes in MySQL have quite poor performance until they are in memory and then they have excellent performance. Not really a good solution for what I wanted to do wit them but still it provides an interesting angle on MySQL.

Thanks for all the help people.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

╰ゝ天使的微笑 2024-10-20 22:29:13

您确定关系数据库是正确的选择吗?如果我是您,我会考虑将您的数据集推送到 Solr 或 Elastic Search(可能将主数据集存储在其他地方)。这些引擎是为索引而构建的,您会注意到响应时间的差异。

Are you sure a relational database is the way to go? If I were you I'd look at pushing your datasets to Solr or Elastic Search (probably storing the master datasets elsewhere). These engines are built for indexing, you will notice the difference in response times.

月光色 2024-10-20 22:29:12

目标是带回所有
与两个 X 相交的项目
沿单个 Y 的坐标
坐标

您是否考虑过使用具有多个字段的索引?就像:

CREATE INDEX spacial_search ON spatial_feature(y, x)

如果您正在使用一组有限的y值,这就是方法。

The goal is to bring back all the
items which intersect two X
coordinates along a single Y
coordinate

Have you considered using an index wit multiple fields? Like:

CREATE INDEX spacial_search ON spatial_feature(y, x)

If you are working with a limited set of y values this is the way to.

心清如水 2024-10-20 22:29:12

我拥有遗传学学位,并且是一名程序员,您不需要使用 X 和 Y 作为您的命名法,它会变得太faffy...您需要一个开始和停止位置 em>(不是“轴”)和染色体编号。首先按染色体编号索引,然后按位置索引,然后按位置索引,然后按染色体索引。 (问题:您正在处理可以有两个阅读框的真核生物或染色体吗?)

EG:(其中“x”=位置,“y”=染色体)

CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);

顺便说一句,染色体是非常长的字符串(就像数据一样),您可以(以加快速度)事物将它们像斑点一样倾倒(即编码基因和垃圾DNA)

I have a degree in Genetics and I am a programmer, you don't need be using an X and a Y as your nomenclature it'll get far too faffy... you need a start and a stop position (not an "axis") and a chromosome number. You index by the chromosome number first then the position and then you index the position then the chromosome. (Question: Are you dealing in eukaryotes or chromosomes that can have two reading frames?)

EG: (where "x" = position and "y" = chromosome)

CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);

Incidentally Chromosomes are very long strings (just like data) you can (to speed things up dump them as such as blobs (i.e. coding genes and junk DNA)

甜扑 2024-10-20 22:29:12

从 EXPLAIN 计划中我们看到,虽然空间可能用于查询('possible_keys' 列),但它没有被使用('key' 列中为 NULL)。
我不确定为什么它没有自动选择,但您可以通过在查询中使用“强制索引”子句指定它来显式指示 MySql 使用索引:

select count(*)
from spatial_feature 
force index (sf_location_idx) --  <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

From the EXPLAIN planwe see that although the spatial might be used for the query ('possible_keys' column) , it is not used (NULL in 'key' column).
I am not sure why it is not selected automatically, but you may explicitly instruct MySql to use the index by specifying it in the query using a 'force index' clause:

select count(*)
from spatial_feature 
force index (sf_location_idx) --  <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文