在 MySQL 中高效查询 15,000,000 行表
考虑以下数据库表:
- 具有 13,000,000 行的表“messages”(每条消息一行)。
- 表“users”有 3,000,000 行(每个用户一行)。
以下查询用于获取一堆消息和对应的用户:
SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id
WHERE messages.id in (?, ?, ?, ? ... a total of 100 "?":s);
每个查询中获取 100 条消息。
“messages”根据 id(主键,BIGINT 不是自动生成)和 user_id 建立索引。
“users”根据 id 建立索引(主键,INT 自动生成)。
数据库是MySQL,使用MyISAM。
目前,该查询的执行时间远远超过 3000 毫秒,这让我很困惑,因为“messages”是在“id”上建立索引的,因此检索正确的行应该非常快。
我的问题是:鉴于所描述的场景和设置,3000 毫秒的查询时间是“正常”还是我遗漏了什么? 如果需要更多详细信息,请告诉我。
更新#1:以下是表定义:
CREATE TABLE messages (
id bigint(20) NOT NULL DEFAULT '0',
user_id int(11) NOT NULL DEFAULT '0',
message varchar(160) NOT NULL DEFAULT '',
PRIMARY KEY (id),
KEY user_id (user_id),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE users (
id int(11) NOT NULL DEFAULT '0',
username varchar(32) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY username (username),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
我在定义中观察到的唯一“非标准”之处是“messages.id”是 BIGINT 而不是 INT。 这能算是一个暗示吗?
Consider the following database tables:
- Table "messages" with 13,000,000 rows (one row per message).
- Table "users" with 3,000,000 rows (one row per user).
The following query is used to fetch a bunch of messages and the corresponding users:
SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id
WHERE messages.id in (?, ?, ?, ? ... a total of 100 "?":s);
100 messages are fetched in each query.
"messages" is indexed on id (primary key, BIGINT not auto-generated) and user_id.
"users" is indexed on id (primary key, INT auto-generated).
The database is MySQL using MyISAM.
Currently the query takes well over 3000 ms to execute which puzzles me since "messages" is indexed on "id", so retrieving the correct rows should be very quick.
My question is: Given the describe scenario and setup, is a 3000 ms query time "normal" or am I missing something? Please let me know if further details are required.
Update #1: Here are the table definitions:
CREATE TABLE messages (
id bigint(20) NOT NULL DEFAULT '0',
user_id int(11) NOT NULL DEFAULT '0',
message varchar(160) NOT NULL DEFAULT '',
PRIMARY KEY (id),
KEY user_id (user_id),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE users (
id int(11) NOT NULL DEFAULT '0',
username varchar(32) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY username (username),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The only "non-standard" thing I observe in the definitions is that "messages.id" is a BIGINT rather than an INT. Could that be a hint?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我曾经研究过包含数十亿行的 MyISAM 表,在行数限制之后我发现的一件事是优化器花了太长时间来决定如何处理查询,并且错误地执行了一些表扫描。 我找不到描述它的确切页面,但我开始总是在每个查询段上使用 FORCE_INDEX ,我知道它应该如何请求对象
http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
事实是,如果您使用那么大的表,则需要设计每个查询以使用索引,因此强制索引没有任何问题。 如果有必要,它仍然会扫描表,但 FORCE_INDEX 告诉它不要这样做,除非绝对必要。
另外,如果您的表很大,我假设您的索引也很大。 您绝对需要确保您有正确的配置设置,并且您的 key_buffer 有足够的大小并且您有足够的 I/O。 如果你运行的是 32 位 mysql(你不应该这样做),那么将你的 key_buffer 设置为 1GB(假设你有 1GB 空闲)并使用 'mysqlreport' 检查其使用情况
如果你运行的是 64 位 mysql,请选择使其尽可能大,同时仍为操作系统留出空间来缓存文件和您正在运行的任何其他应用程序,因此如果可以的话,可能需要几 GB。
即使您的查询使用索引,如果索引无法在内存中正确缓冲,您仍然会访问磁盘,并且性能损失与索引大小和磁盘/可用 I/O 的速度成正比。
就 int 与 big int 而言,我见过的唯一明显的性能差异是在对 big int 执行计算时,例如 SUM。 SUM 在 big int 上比在 int 上慢得多,以至于我会考虑以不同的量级存储数字,或者如果需要对它们执行频繁的计算,则将它们分成两个 int。
I've worked on MyISAM tables with billions of rows, and one of the things I found after some row count limit was that the optimizer took far too long to decide how to approach a query, and wrongly performed some table scans. I can't find the exact page I found describing it, but I started always using FORCE_INDEX on each segment of queries where I knew how it should request objects
http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
The fact of the matter is, if you're using tables that large, you need to be designing each and every query to work with your indices, so there's nothing wrong with forcing an index. It will still scan the table if it has to, but FORCE_INDEX tells it not to unless it absolutely has to.
Also, if your tables are large, I assume your indices are also large. You absolutely need to make sure you have proper configuration settings and that your key_buffer is an adequate size and you have enough i/o. If you're running 32bit mysql (which you shouldn't be), then put your key_buffer as large as 1GB (asuming you have 1GB to spare) and check its usage with 'mysqlreport'
If you're running 64bit mysql, opt for making it as large as you can, while still leaving space for the OS to cache files and whatever other applications you have running, so maybe a couple of GB if you can.
Even if your queries use indices, if the index can't be properly buffered in memory, you're still hitting the disk and there's a performance hit proportional to the index size and speed of disk / available i/o.
As far as int vs big int, the only noticeable performance difference I've seen is in performing calculations on big ints, like SUM. SUM is appreciably slower on big int than on int, so much so that I'd look at storing numbers in a different magnitude, or separating them out to two ints if you need to perform frequent calculations on them.
您的消息似乎具有数据类型
TEXT
并且很长。长
TEXT
列存储在行外,这就是为什么您需要进行一些额外的页面读取来检索它们,这可能需要很长时间。您能否检查两件事:
此查询的性能:
It seems that your messages have datatype
TEXT
and are long.Long
TEXT
columns are stored out-of-row, that's why you'll need to make some extra page reads to retrieve them which may take long.Could you please check two things:
Performance of this query:
嗯,查询和表设计本身可能不是原因。 尽管查询可以使用一些帮助(例如将“在列表中”添加到连接谓词中以消除后期过滤器,但我猜优化器无论如何都会返回相同的计划)
我的猜测是这是其他的症状问题、索引\表碎片或过时的统计信息。 这些表经常被删除吗? 对表和索引进行碎片整理可能会有所帮助,否则您可能会成为仅占 10% 或更少的页面的受害者,这会导致大量磁盘 I/O。
注意:使用主键的整数种子,您不会经常看到大量碎片,除非您进行大量删除和删除操作。 更新行。
Well, the query and table design themselves are likely not the cause. Though the query could use some help(like adding the "in list" into the join predicate as to eliminate a late-side filter, though i'd guess the optimizer returns the same plan regardless)
My guess is this is a symptom of other issues, index\table fragmentation or out of date stats. Are these tables often deleted from? It might help to defrag the tables and indexes, otherwise you may be a victim of pages that are only 10% or less full, which would cause alot of disk I/O.
Note: with an integer seed for a primary key, you don't often see alot of fragmentation unless you get alot of deletes & updates to rows.
每次 ,还是仅执行第一个查询? 难道第一个查询会产生加载索引等的成本?
作为比较,对特定消息 ID 执行相同的查询需要多长时间?
另外,根据您运行此程序的机器的规格,以及查看其他人建议的执行计划,也可能值得查看 mysqld 的内存使用情况并确保它不是简单的交换。
Every time, or just the first query? Could it be that the first query is incurring the cost of loading the indices etc?
And for comparison, how long does it take to do the same query for a specific message ID?
Also depending on the spec of the box you're running this on, then as well as looking at the execution plans as suggested by others it may also be worth looking at the memory usage of mysqld and making sure that it's not simply swapping.
由于这通常由解析器重写为:
我有兴趣查看单个案例的执行计划和性能:
在这种情况下,您最好执行
UNION
或创建包含 ID 并执行JOIN
的表。Since this is usually re-written by the parser as:
I'd be interested to see the execution plan and the performance for a single case:
In which case you might be better off doing a
UNION
or creating a table containing the IDs and performing aJOIN
.您在这里查看什么硬件? 我假设您有一台具有合理数量的 ram 且 key_buffer 设置相当大的服务器(例如,大于两个中等大小的表的组合索引大小)。 我假设该服务器是一台闲置的性能测试服务器。
你能测量IO的数量吗?
如果重复完全相同的查询,速度快吗?
如果将整个数据库加载到 ram 光盘中(一个只有 15M 行的小表将很容易装入 ram 光盘),速度会更快吗?
另外(正如其他人所指出的),发布解释计划。
但如此小的数据库应该总是很快,因为除了最弱的服务器之外,它可以容纳在所有服务器的内存中。
What hardware are you looking at here? I assume you have a server with a reasonable amount of ram and key_buffer set quite large (say, bigger than the combined index sizes of the two modest sized tables). I assume the server is an otherwise idle performance-test server.
Can you measure the amount of IOs?
If you repeat the exact same query, is it fast?
If you load the entire database into a ram disc (a small table with only 15M rows is going to fit in a ram disc quite easily) is it faster?
Also (as others have noted), post the EXPLAIN plan.
But such a small database should always be fast as it will fit in ram on all but the puniest of servers.