在 MySQL 中使用空间索引时性能不佳

发布于 2024-10-13 22:29:12 字数 2753 浏览 6 评论 0原文

我正在尝试进行一个小实验来推送一个不是地理空间的数据集，但非常适合它，并且发现结果有些令人不安。数据集是基因组数据，例如人类基因组，其中有一个 DNA 区域，其中基因等元素占据特定的起始和终止坐标（我们的 X 轴）。我们有多个占据 Y 轴的 DNA 区域（染色体）。目标是返回沿单个 Y 坐标与两个 X 坐标相交的所有项目，例如 LineString(START 1, END 2)。

这个理论看起来很合理，所以我将它推入一个现有的基于 MySQL 的基因组项目中，并提出了一个表结构，如下所示：

CREATE TABLE `spatial_feature` (
  `spatial_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `external_id` int(10) unsigned NOT NULL,
  `external_type` int(3) unsigned NOT NULL,
  `location` geometry NOT NULL,
  PRIMARY KEY (`spatial_feature_id`),
  SPATIAL KEY `sf_location_idx` (`location`)
) ENGINE=MyISAM;

external_id 代表我们编码到此中的实体的标识符表和external_type 对其来源进行编码。一切看起来都不错，我输入了一些初步数据（30,000 行），看起来效果很好。当这一数字超过 300 万行标记时，MySQL 拒绝使用空间索引，并且在被迫使用空间索引时速度会变慢（使用全表扫描需要 40 秒，而使用全表扫描则需要 5 秒）。当添加更多数据时，开始使用索引，但性能损失仍然存在。强制关闭索引将查询时间缩短至 8 秒。我使用的查询如下所示：

select count(*)
from spatial_feature
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

输入的数据沿 Y 维度非常密集（就像您记录了一条很长的道路上每座建筑物、电话亭、邮政信箱和鸽子的位置一样）。我已经在 Java 中测试了 R 索引如何处理这些数据，并且该领域的其他人已经成功地将它们应用于平面文件格式。然而，没有人将它们应用到数据库中，据我所知，这是本次测试的目标。

当将大量数据添加到沿着特定轴不是非常不同的空间模型时，是否有人看到类似的行为？如果我反转坐标使用，问题仍然存在。如果这是一个原因，我正在运行以下设置

MacOS 10.6.6
MySQL 5.1.46

帮助！

中引入解释计划

+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| id | select_type | table           | type | possible_keys   | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | spatial_feature | ALL  | sf_location_idx | NULL | NULL    | NULL | 3636060 |    33.33 | Using where |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

还在重写的 SQL

select count(0) AS `count(*)` from `arabidopsis_thaliana_core_7_60_9`.`spatial_feature` where intersects(geometryfromtext('LineString(7420023 1, 7420023 1)'),`arabidopsis_thaliana_core_7_60_9`.`spatial_feature`.`location`)

，如下所示仍然没有突出显示为什么该查询的性能如此差

在阅读了 @Fraser 从 rickonrails 看来问题全都与索引不在内存中有关。如果我将类似的技术应用于本文中提到的技术（使键缓冲区确实非常大），然后我强制查询使用索引查询时间。我们仍然看到查询区域和查询区域之间存在滞后。然后搜索该区域的子集，但这一切都指向正确加载索引。

这个故事的寓意是什么？ MySQL 中的 R 索引性能相当差，直到进入内存后才具有出色的性能。对于我想要用它们做的事情来说，这并不是一个很好的解决方案，但它仍然提供了一个关于 MySQL 的有趣角度。

感谢所有帮助的人。

原文

I'm trying a little experiment at pushing a data set which is not geo-spatial but fits it quite well and am finding the results somewhat unsettling. The data set is genomic data e.g. the Human Genome where we have a region of DNA where elements like genes occupy specific start and stop coordinates (our X axis). We have multiple regions of DNA (chromosomes) which occupy the Y axis. The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate e.g. LineString(START 1, END 2).

The theory seemed sound so I pushed it into an existing MySQL based genome project and came up with a table structure like:

CREATE TABLE `spatial_feature` (
  `spatial_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `external_id` int(10) unsigned NOT NULL,
  `external_type` int(3) unsigned NOT NULL,
  `location` geometry NOT NULL,
  PRIMARY KEY (`spatial_feature_id`),
  SPATIAL KEY `sf_location_idx` (`location`)
) ENGINE=MyISAM;

external_id represents the identifier of the entity we have encoded into this table & external_type encodes the source of this. Everything looked good and I pushed in some preliminary data (30,000 rows) which seemed to work well. When this increased past the 3 million row mark MySQL refused to use the spatial index and was slower when it was forced to use it (40 seconds vs. 5 seconds using a full table scan). When more data was added the index started to be used but the performance penalty persisted. Forcing the index off brought the query down to 8 seconds. The query I'm using looks like:

select count(*)
from spatial_feature
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

The data going into this is be very dense along the Y dimensions (think of it like you've recorded the position of every building, telephone box, post box and pigeon on a very long road). I've done tests of how R-Indexes behave with this data in Java as well as others in the field have applied them to flat-file formats with success. However no one has applied them to databases AFAIK which is the goal of this test.

Has anyone out there seen a similar behaviour when adding large quantities of data to a spatial model which is not very disparate along a particular axis? The problem persists if I reverse the coordinate usage. I'm running the following setup if that's a cause

MacOS 10.6.6
MySQL 5.1.46

Help!

Also bringing in explain plan in

+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| id | select_type | table           | type | possible_keys   | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | spatial_feature | ALL  | sf_location_idx | NULL | NULL    | NULL | 3636060 |    33.33 | Using where |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

The re-written SQL looks like the following

select count(0) AS `count(*)` from `arabidopsis_thaliana_core_7_60_9`.`spatial_feature` where intersects(geometryfromtext('LineString(7420023 1, 7420023 1)'),`arabidopsis_thaliana_core_7_60_9`.`spatial_feature`.`location`)

Still not highlighting why this query's performance is so poor

After reading the article posted by @Fraser from rickonrails it seems like the problem is all to do with the index not being in memory. If I apply similar techniques to those mentioned in the article (making key buffer very big indeed) and I then force the query to use the index query times plumet. We still see a lag between querying a region & then searching for a subset of the region but it's all pointing to getting the load of the indexes correct.

What's the moral of the story? R-Indexes in MySQL have quite poor performance until they are in memory and then they have excellent performance. Not really a good solution for what I wanted to do wit them but still it provides an interesting angle on MySQL.

Thanks for all the help people.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╰ゝ天使的微笑 2024-10-20 22:29:13

您确定关系数据库是正确的选择吗？如果我是您，我会考虑将您的数据集推送到 Solr 或 Elastic Search（可能将主数据集存储在其他地方）。这些引擎是为索引而构建的，您会注意到响应时间的差异。

回复收藏 0 原文

月光色 2024-10-20 22:29:12

目标是带回所有
与两个 X 相交的项目
沿单个 Y 的坐标
坐标

您是否考虑过使用具有多个字段的索引？就像：

CREATE INDEX spacial_search ON spatial_feature(y, x)

如果您正在使用一组有限的y值，这就是方法。

The goal is to bring back all the
items which intersect two X
coordinates along a single Y
coordinate

Have you considered using an index wit multiple fields? Like:

CREATE INDEX spacial_search ON spatial_feature(y, x)

If you are working with a limited set of y values this is the way to.

回复收藏 0 原文

心清如水 2024-10-20 22:29:12

我拥有遗传学学位，并且是一名程序员，您不需要使用 X 和 Y 作为您的命名法，它会变得太faffy...您需要一个开始和停止位置 em>（不是“轴”）和染色体编号。首先按染色体编号索引，然后按位置索引，然后按位置索引，然后按染色体索引。（问题：您正在处理可以有两个阅读框的真核生物或染色体吗？）

EG：（其中“x”=位置，“y”=染色体）

CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);

顺便说一句，染色体是非常长的字符串（就像数据一样），您可以（以加快速度）事物将它们像斑点一样倾倒（即编码基因和垃圾DNA）

I have a degree in Genetics and I am a programmer, you don't need be using an X and a Y as your nomenclature it'll get far too faffy... you need a start and a stop position (not an "axis") and a chromosome number. You index by the chromosome number first then the position and then you index the position then the chromosome. (Question: Are you dealing in eukaryotes or chromosomes that can have two reading frames?)

EG: (where "x" = position and "y" = chromosome)

CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);

Incidentally Chromosomes are very long strings (just like data) you can (to speed things up dump them as such as blobs (i.e. coding genes and junk DNA)

回复收藏 0 原文

甜扑 2024-10-20 22:29:12

从 EXPLAIN 计划中我们看到，虽然空间可能用于查询（'possible_keys' 列），但它没有被使用（'key' 列中为 NULL）。
我不确定为什么它没有自动选择，但您可以通过在查询中使用“强制索引”子句指定它来显式指示 MySql 使用索引：

select count(*)
from spatial_feature 
force index (sf_location_idx) --  <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

From the EXPLAIN planwe see that although the spatial might be used for the query ('possible_keys' column) , it is not used (NULL in 'key' column).
I am not sure why it is not selected automatically, but you may explicitly instruct MySql to use the index by specifying it in the query using a 'force index' clause:

select count(*)
from spatial_feature 
force index (sf_location_idx) --  <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

回复收藏 0 原文

~没有更多了~