文本字段上的 COUNT 和 GROUP BY 似乎很慢

发布于 2024-09-11 01:50:36 字数 1355 浏览 3 评论 0原文

我正在构建一个 MySQL 数据库,其中包含有关酵母物种 DNA 特殊子串的条目。我的表如下所示:

+--------------+---------+------+-----+---------+-------+
| Field        | Type    | Null | Key | Default | Extra |
+--------------+---------+------+-----+---------+-------+
| species      | text    | YES  | MUL | NULL    |       |
| region       | text    | YES  | MUL | NULL    |       |
| gene         | text    | YES  | MUL | NULL    |       |
| startPos     | int(11) | YES  |     | NULL    |       |
| repeatLength | int(11) | YES  |     | NULL    |       |
| coreLength   | int(11) | YES  |     | NULL    |       |
| sequence     | text    | YES  | MUL | NULL    |       |
+--------------+---------+------+-----+---------+-------+

大约有 180 万条记录。在一种类型的查询中,我想查看有多少 DNA 子串与每种类型的物种和区域相关,因此我发出此查询:

select species, region, count(*) group by species, region;

物种和区域列只有两个可能的条目(物种的保守/scer,启动子/编码)对于区域),但此查询大约需要 30 秒

考虑到表的大小,对于此类查询来说,这是正常的预期时间吗?它是否很慢,因为我使用文本字段而不是简单的整数或布尔值(我更喜欢文本字段,因为一些非 CS 研究人员将使用数据库)。任何其他想法和建议将受到欢迎。

如果这是一个愚蠢的问题,请原谅,我是一个 SQL 新手。

PS我也看到了这个问题 但提议的解决方案似乎与我正在做的事情无关。

编辑:将这些字段转换为 VARCHAR 将运行时间减少到约 2.5 秒。请注意,我还针对具有相似计时的 ENUM 进行了计时。

I'm building a MySQL database which contains entries about special substrings of DNA in species of yeast. My table looks like this:

+--------------+---------+------+-----+---------+-------+
| Field        | Type    | Null | Key | Default | Extra |
+--------------+---------+------+-----+---------+-------+
| species      | text    | YES  | MUL | NULL    |       |
| region       | text    | YES  | MUL | NULL    |       |
| gene         | text    | YES  | MUL | NULL    |       |
| startPos     | int(11) | YES  |     | NULL    |       |
| repeatLength | int(11) | YES  |     | NULL    |       |
| coreLength   | int(11) | YES  |     | NULL    |       |
| sequence     | text    | YES  | MUL | NULL    |       |
+--------------+---------+------+-----+---------+-------+

There are approximately 1.8 million records. In one type of query I want to see how many DNA substrings are associated with each type of species and region, so I issue this query:

select species, region, count(*) group by species, region;

The species and region columns have only two possible entries (conserved/scer for species, and promoter/coding for region) yet this query takes about 30 seconds.

Is this a normal amount of time to expect for this type of query given the size of the table? Is it slow because I'm using text fields instead of simple integer or boolean values (I prefer text fields as several non-CS researchers will be using the DB). Any other ideas and suggestions would be welcome.

Please excuse if this is a boneheaded question, I am an SQL neophyte.

P.S. I've also seen this question but the proposed solution doesn't seem relevant for what I'm doing.

EDIT: Converting those fields to VARCHARs reduced the runtime to ~2.5 seconds. Note I also timed it against ENUMs which had a similar timing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

空心↖ 2024-09-18 01:50:37

为什么所有基于字符串的列都定义为 TEXT?如果您阅读性能比较,您会发现 TEXT 比使用相同索引的 VARCHAR 列慢约 3 倍: http://forums.mysql.com/read.php?24,105964,105964

Why're all your string based columns defined as TEXT? If you read the performance comparison, you'll see that TEXT was ~3x slower than a VARCHAR column using identical indexing: http://forums.mysql.com/read.php?24,105964,105964

淡淡的优雅 2024-09-18 01:50:37

如果您的字段只会有 2 个值,那么最好将它们设置为布尔值。您还应该将所有内容设置为NOT NULL,除非有真正的原因需要将其设置为NULL

另请查看 ENUM 类型< /a> 为列使用有限数量的人类可读值的更好方法。

至于速度慢,首先要尝试的是在列上创建索引。对于您在此处显示的特定查询,物种、地区 上的索引应该会产生巨大的差异:

create index on mytablename (species, region);

应该做到这一点。

If your fields are only ever going to have 2 values, you're much better off making them booleans. You should also make everything NOT NULL unless there's a real reason you'll need it to be NULL.

Also take a look at the ENUM type for a better way to use a finite number of human-readable values for a column.

As for slowness, the first thing to try is to create indices on your columns. For the particular query you're showing here, an index on species, region should make a huge difference:

create index on mytablename (species, region);

should do it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文