MySQL 中能够连接的合理行数和表数是多少?
我有一张将位置映射到邮政编码的表。例如,纽约州有大约 2000 个邮政编码。我有另一个表将邮件映射到其发送到的邮政编码,但该表大约有 500 万行。我想找到所有发送到纽约州的邮件,这看起来很简单,但是查询速度慢得令人难以置信。我什至没能等足够长的时间来完成它。问题是有500万行吗?我不禁想到,500 万对于现在的计算机来说不应该是一个很大的数字......哦,所有的东西都被索引了。难道 SQL 就不是为处理如此大的连接而设计的吗?
更新:正如人们所问,我已经用我正在使用的表定义和查询更新了这个问题。
-- Roughly 70,000 rows
CREATE TABLE `mail_zip` (
`mail_id` int(11) default NULL,
`zip` int(11) default NULL,
KEY `index_mail_zip_on_mail_id` (`mail_id`),
KEY `index_mail_zip_on_zip` (`zip`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
-- Roughly 5,000,000 rows
CREATE TABLE `geographies` (
`city_id` int(11) default NULL,
`postal_code` int(11) default NULL,
KEY `index_geographies_on_city_id` (`city_id`),
KEY `index_geographies_on_postal_code` (`postal_code`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
-- Query
select mz.mail_id from mail_zip mz join geographies g on mz.zip = g.postal_code where g.city_id = 36 limit 10;
更新2:好吧,我撒谎了。使用正确的索引,上述查询可以正常工作。问题实际上是 order by 子句。请参阅下面两个几乎相同的查询:唯一的区别是“order by m.sent_on desc”,这为查询增加了额外的 4 分 30 秒!另外,使用解释,添加顺序使用文件排序,这一定是减慢速度的原因。但是,sent_on 是有索引的,那么为什么它不使用索引呢?我一定没有正确建立索引。
-- Roughly 350,000 rows
CREATE TABLE `mail` (
`id` int(11) NOT NULL auto_increment,
`sent_on` datetime default NULL,
`title` varchar(255) default NULL,
PRIMARY KEY (`id`),
KEY `index_mail_on_sent_on` (`sent_on`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1
-- Runs in 0.19 seconds
-- Query
select distinct(m.id), m.title from mail m join mail_zip mz on mz.mail_id = m.id join geographies g on g.postal_code = mz.zip where g.city_id = 36 limit 10;
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
| 1 | SIMPLE | mz | ALL | index_mail_zip_on_com_id,index_mail_zip_on_zip | NULL | NULL | NULL | 5260053 | Using temporary |
| 1 | SIMPLE | m | eq_ref | PRIMARY | PRIMARY | 4 | mz.com_id | 1 | |
| 1 | SIMPLE | g | ref | index_geographies_on_city_id,zip | zip | 5 | mz.zip | 1 | Using where; Distinct |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
-- Runs in 4 minutes and 30 seconds
-- Query
select distinct(m.id), m.title from mail m join mail_zip mz on mz.mail_id = m.id join geographies g on g.postal_code = mz.zip where g.city_id = 36 order by m.sent_on desc limit 10;
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
| 1 | SIMPLE | mz | ALL | index_mail_zip_on_com_id,index_mail_zip_on_zip | NULL | NULL | NULL | 5260053 | Using temporary; Using filesort |
| 1 | SIMPLE | m | eq_ref | PRIMARY | PRIMARY | 4 | mz.com_id | 1 | |
| 1 | SIMPLE | g | ref | index_geographies_on_city_id,zip | zip | 5 | mz.zip | 1 | Using where; Distinct |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
I have one table that maps locations to postal codes. For example, New York State has about 2000 postal codes. I have another table that maps mail to the postal codes it was sent to, but this table has about 5 million rows. I want to find all the mail that was sent to New York State, which seems simple enough, but the query is unbelievably slow. I haven't been able to even wait long enough for it to finish. Is the problem that there are 5 million rows? I can't help but think that 5 million shouldn't be such a large number for a computer these days... Oh, and everything is indexed. Is SQL just not designed to handle such large joins?
UPDATE: as people have asked, I've updated this question with the table definitions and the query that I'm using.
-- Roughly 70,000 rows
CREATE TABLE `mail_zip` (
`mail_id` int(11) default NULL,
`zip` int(11) default NULL,
KEY `index_mail_zip_on_mail_id` (`mail_id`),
KEY `index_mail_zip_on_zip` (`zip`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
-- Roughly 5,000,000 rows
CREATE TABLE `geographies` (
`city_id` int(11) default NULL,
`postal_code` int(11) default NULL,
KEY `index_geographies_on_city_id` (`city_id`),
KEY `index_geographies_on_postal_code` (`postal_code`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
-- Query
select mz.mail_id from mail_zip mz join geographies g on mz.zip = g.postal_code where g.city_id = 36 limit 10;
UPDATE 2: okay, I lied. With the proper indices, the above query works fine. The problem is actually the order by clause. See the two nearly identical queries below: the only difference is "order by m.sent_on desc" which adds an extra 4 minutes and 30 seconds to the query! Also, using explain, adding the order by uses a filesort which must be what's slowing it down. However, sent_on is indexed, so why isn't it using the index? I must not be making the index properly.
-- Roughly 350,000 rows
CREATE TABLE `mail` (
`id` int(11) NOT NULL auto_increment,
`sent_on` datetime default NULL,
`title` varchar(255) default NULL,
PRIMARY KEY (`id`),
KEY `index_mail_on_sent_on` (`sent_on`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1
-- Runs in 0.19 seconds
-- Query
select distinct(m.id), m.title from mail m join mail_zip mz on mz.mail_id = m.id join geographies g on g.postal_code = mz.zip where g.city_id = 36 limit 10;
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
| 1 | SIMPLE | mz | ALL | index_mail_zip_on_com_id,index_mail_zip_on_zip | NULL | NULL | NULL | 5260053 | Using temporary |
| 1 | SIMPLE | m | eq_ref | PRIMARY | PRIMARY | 4 | mz.com_id | 1 | |
| 1 | SIMPLE | g | ref | index_geographies_on_city_id,zip | zip | 5 | mz.zip | 1 | Using where; Distinct |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+-----------------------+
-- Runs in 4 minutes and 30 seconds
-- Query
select distinct(m.id), m.title from mail m join mail_zip mz on mz.mail_id = m.id join geographies g on g.postal_code = mz.zip where g.city_id = 36 order by m.sent_on desc limit 10;
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
| 1 | SIMPLE | mz | ALL | index_mail_zip_on_com_id,index_mail_zip_on_zip | NULL | NULL | NULL | 5260053 | Using temporary; Using filesort |
| 1 | SIMPLE | m | eq_ref | PRIMARY | PRIMARY | 4 | mz.com_id | 1 | |
| 1 | SIMPLE | g | ref | index_geographies_on_city_id,zip | zip | 5 | mz.zip | 1 | Using where; Distinct |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+----------------------+---------+---------------------------------+
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
MySQL 完全有能力处理涉及 500 万行甚至更多的联接。
您的问题可能是两件事之一:
既然你声称“一切都被索引”,我猜它是第二个。发布您的餐桌信息和您的疑问,我们应该能够帮助您解决问题。
您还可以对查询运行 EXPLAIN 以查看它正在使用哪些索引。
MySQL is perfectly capable of handling joins involving 5 million rows or even many more than this.
Your problem is probably one of two things:
Since you claim that "everything is indexed" I'd guess that it is the second. Post your table information and your query and we should be able to help you fix it.
You can also run EXPLAIN on your query to see which indexes it is using.
为了便于论证,您应该能够连接 10 个表,其中最大的表的行数达到数百万及以上,并且您应该能够快速获得结果。
假设索引策略、查询操作或查询计划出现问题。
它与 SQL 本身无关;它可能与 MySQL 或您在 MySQL 中使用的特定存储引擎有关。
您是否知道 SQL 标准没有定义任何与索引相关的内容?您可能会争辩说,与索引相关的任何内容都是非标准的,尽管“超出标准”可能是更好的看待它的方式。
You should be able to join, for sake of argument, 10 tables with the biggest table(s) having rows in the multiple millions and upwards, and you should be able to get results fast.
Assume that there is something up with the indexing strategy or the query operation or the query plan.
It has nothing to do with SQL per se; it might be related to MySQL, or the particular storage engine you are using in MySQL.
Did you know that the SQL standard doesn't define anything related to indexes? You could argue that anything related to indexes is non-standard, though 'extra to the standard' might be a better way to look at it.