在 MySQL 中高效查询 15,000,000 行表

发布于 2024-07-26 11:52:56 字数 1225 浏览 2 评论 0原文

考虑以下数据库表:

  • 具有 13,000,000 行的表“messages”(每条消息一行)。
  • 表“users”有 3,000,000 行(每个用户一行)。

以下查询用于获取一堆消息和对应的用户:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id in (?, ?, ?, ? ... a total of 100 "?":s);

每个查询中获取 100 条消息。

“messages”根据 id(主键,BIGINT 不是自动生成)和 user_id 建立索引。

“users”根据 id 建立索引(主键,INT 自动生成)。

数据库是MySQL,使用MyISAM。

目前,该查询的执行时间远远超过 3000 毫秒,这让我很困惑,因为“messages”是在“id”上建立索引的,因此检索正确的行应该非常快。

我的问题是:鉴于所描述的场景和设置,3000 毫秒的查询时间是“正常”还是我遗漏了什么? 如果需要更多详细信息,请告诉我。

更新#1:以下是表定义:

CREATE TABLE messages (
  id bigint(20) NOT NULL DEFAULT '0',
  user_id int(11) NOT NULL DEFAULT '0',
  message varchar(160) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  KEY user_id (user_id),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE users (
  id int(11) NOT NULL DEFAULT '0',
  username varchar(32) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  UNIQUE KEY username (username),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

我在定义中观察到的唯一“非标准”之处是“messages.id”是 BIGINT 而不是 INT。 这能算是一个暗示吗?

Consider the following database tables:

  • Table "messages" with 13,000,000 rows (one row per message).
  • Table "users" with 3,000,000 rows (one row per user).

The following query is used to fetch a bunch of messages and the corresponding users:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id in (?, ?, ?, ? ... a total of 100 "?":s);

100 messages are fetched in each query.

"messages" is indexed on id (primary key, BIGINT not auto-generated) and user_id.

"users" is indexed on id (primary key, INT auto-generated).

The database is MySQL using MyISAM.

Currently the query takes well over 3000 ms to execute which puzzles me since "messages" is indexed on "id", so retrieving the correct rows should be very quick.

My question is: Given the describe scenario and setup, is a 3000 ms query time "normal" or am I missing something? Please let me know if further details are required.

Update #1: Here are the table definitions:

CREATE TABLE messages (
  id bigint(20) NOT NULL DEFAULT '0',
  user_id int(11) NOT NULL DEFAULT '0',
  message varchar(160) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  KEY user_id (user_id),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE users (
  id int(11) NOT NULL DEFAULT '0',
  username varchar(32) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  UNIQUE KEY username (username),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

The only "non-standard" thing I observe in the definitions is that "messages.id" is a BIGINT rather than an INT. Could that be a hint?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

爱人如己 2024-08-02 11:52:56

我曾经研究过包含数十亿行的 MyISAM 表,在行数限制之后我发现的一件事是优化器花了太长时间来决定如何处理查询,并且错误地执行了一些表扫描。 我找不到描述它的确切页面,但我开始总是在每个查询段上使用 FORCE_INDEX ,我知道它应该如何请求对象

http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

事实是,如果您使用那么大的表,则需要设计每个查询以使用索引,因此强制索引没有任何问题。 如果有必要,它仍然会扫描表,但 FORCE_INDEX 告诉它不要这样做,除非绝对必要。

另外,如果您的表很大,我假设您的索引也很大。 您绝对需要确保您有正确的配置设置,并且您的 key_buffer 有足够的大小并且您有足够的 I/O。 如果你运行的是 32 位 mysql(你不应该这样做),那么将你的 key_buffer 设置为 1GB(假设你有 1GB 空闲)并使用 'mysqlreport' 检查其使用情况

如果你运行的是 64 位 mysql,请选择使其尽可能大,同时仍为操作系统留出空间来缓存文件和您正在运行的任何其他应用程序,因此如果可以的话,可能需要几 GB。

即使您的查询使用索引,如果索引无法在内存中正确缓冲,您仍然会访问磁盘,并且性能损失与索引大小和磁盘/可用 I/O 的速度成正比。

就 int 与 big int 而言,我见过的唯一明显的性能差异是在对 big int 执行计算时,例如 SUM。 SUM 在 big int 上比在 int 上慢得多,以至于我会考虑以不同的量级存储数字,或者如果需要对它们执行频繁的计算,则将它们分成两个 int。

I've worked on MyISAM tables with billions of rows, and one of the things I found after some row count limit was that the optimizer took far too long to decide how to approach a query, and wrongly performed some table scans. I can't find the exact page I found describing it, but I started always using FORCE_INDEX on each segment of queries where I knew how it should request objects

http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

The fact of the matter is, if you're using tables that large, you need to be designing each and every query to work with your indices, so there's nothing wrong with forcing an index. It will still scan the table if it has to, but FORCE_INDEX tells it not to unless it absolutely has to.

Also, if your tables are large, I assume your indices are also large. You absolutely need to make sure you have proper configuration settings and that your key_buffer is an adequate size and you have enough i/o. If you're running 32bit mysql (which you shouldn't be), then put your key_buffer as large as 1GB (asuming you have 1GB to spare) and check its usage with 'mysqlreport'

If you're running 64bit mysql, opt for making it as large as you can, while still leaving space for the OS to cache files and whatever other applications you have running, so maybe a couple of GB if you can.

Even if your queries use indices, if the index can't be properly buffered in memory, you're still hitting the disk and there's a performance hit proportional to the index size and speed of disk / available i/o.

As far as int vs big int, the only noticeable performance difference I've seen is in performing calculations on big ints, like SUM. SUM is appreciably slower on big int than on int, so much so that I'd look at storing numbers in a different magnitude, or separating them out to two ints if you need to perform frequent calculations on them.

っ〆星空下的拥抱 2024-08-02 11:52:56
  1. 我们需要解释。
  2. MyISAM 的并发性很差。 考虑一下并发插入可能会让您头疼。 对于如此大的数据库,InnoDB 可能是一个发展方向。
  3. 如果插入和删除消息,如果您的表偶尔没有优化,这可能会导致事情出现偏差。 另外,MyISAM 主键不是集群的。 同样,对于如此大的数据库,InnoDB 可能是一个发展方向。
  1. We NEED that explain.
  2. MyISAM offers poor concurrency. Consider that concurrent inserts may be causing you headaches. With such a large databse, InnoDB may be the direction to go.
  3. If messages are being inserted and deleted, this could be causing things to skew if your tables aren't occasionally optimized. Also, MyISAM primary keys aren't clusterd. Again, with such a large databse, InnoDB may be the direction to go.
溺ぐ爱和你が 2024-08-02 11:52:56
SELECT  messages.id, messages.message, users.id, users.username
FROM    messages
INNER JOIN
        users
ON      users.id = messages.user_id
WHERE   messages.id in (?, ?, ?, ? ... a total of 100 "?":s);

您的消息似乎具有数据类型 TEXT 并且很长。

TEXT 列存储在行外,这就是为什么您需要进行一些额外的页面读取来检索它们,这可能需要很长时间。

您能否检查两件事:

  1. 此查询的性能:

    SELECT messages.id、users.id、users.username 
      来自消息 
      内部联接 
              用户 
      ON users.id = messages.user_id 
      WHERE messages.id in (?, ?, ?, ? ... 总共 100 个“?”:s); 
      
    • 此查询和您的原始查询生成的执行计划。
SELECT  messages.id, messages.message, users.id, users.username
FROM    messages
INNER JOIN
        users
ON      users.id = messages.user_id
WHERE   messages.id in (?, ?, ?, ? ... a total of 100 "?":s);

It seems that your messages have datatype TEXT and are long.

Long TEXT columns are stored out-of-row, that's why you'll need to make some extra page reads to retrieve them which may take long.

Could you please check two things:

  1. Performance of this query:

    SELECT  messages.id, users.id, users.username
    FROM    messages
    INNER JOIN
            users
    ON      users.id = messages.user_id
    WHERE   messages.id in (?, ?, ?, ? ... a total of 100 "?":s);
    
    • Execution plans generated by this query and your original query.
记忆で 2024-08-02 11:52:56

嗯,查询和表设计本身可能不是原因。 尽管查询可以使用一些帮助(例如将“在列表中”添加到连接谓词中以消除后期过滤器,但我猜优化器无论如何都会返回相同的计划)

我的猜测是这是其他的症状问题、索引\表碎片或过时的统计信息。 这些表经常被删除吗? 对表和索引进行碎片整理可能会有所帮助,否则您可能会成为仅占 10% 或更少的页面的受害者,这会导致大量磁盘 I/O。

注意:使用主键的整数种子,您不会经常看到大量碎片,除非您进行大量删除和删除操作。 更新行。

Well, the query and table design themselves are likely not the cause. Though the query could use some help(like adding the "in list" into the join predicate as to eliminate a late-side filter, though i'd guess the optimizer returns the same plan regardless)

My guess is this is a symptom of other issues, index\table fragmentation or out of date stats. Are these tables often deleted from? It might help to defrag the tables and indexes, otherwise you may be a victim of pages that are only 10% or less full, which would cause alot of disk I/O.

Note: with an integer seed for a primary key, you don't often see alot of fragmentation unless you get alot of deletes & updates to rows.

无声情话 2024-08-02 11:52:56

目前查询已经超过
执行时间为 3000 毫秒

每次 ,还是仅执行第一个查询? 难道第一个查询会产生加载索引等的成本?

作为比较,对特定消息 ID 执行相同的查询需要多长时间?

另外,根据您运行此程序的机器的规格,以及查看其他人建议的执行计划,也可能值得查看 mysqld 的内存使用情况并确保它不是简单的交换。

Currently the query takes well over
3000 ms to execute

Every time, or just the first query? Could it be that the first query is incurring the cost of loading the indices etc?

And for comparison, how long does it take to do the same query for a specific message ID?

Also depending on the spec of the box you're running this on, then as well as looking at the execution plans as suggested by others it may also be worth looking at the memory usage of mysqld and making sure that it's not simply swapping.

回忆凄美了谁 2024-08-02 11:52:56

由于这通常由解析器重写为:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id = ?
OR messages.id = ?
OR messages.id = ? etc.

我有兴趣查看单个案例的执行计划和性能:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id = ?

在这种情况下,您最好执行 UNION 或创建包含 ID 并执行 JOIN 的表。

Since this is usually re-written by the parser as:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id = ?
OR messages.id = ?
OR messages.id = ? etc.

I'd be interested to see the execution plan and the performance for a single case:

SELECT messages.id, messages.message, users.id, users.username
FROM messages
INNER JOIN users ON messages.user_id=users.id 
WHERE messages.id = ?

In which case you might be better off doing a UNION or creating a table containing the IDs and performing a JOIN.

允世 2024-08-02 11:52:56

您在这里查看什么硬件? 我假设您有一台具有合理数量的 ram 且 key_buffer 设置相当大的服务器(例如,大于两个中等大小的表的组合索引大小)。 我假设该服务器是一台闲置的性能测试服务器。

你能测量IO的数量吗?

如果重复完全相同的查询,速度快吗?

如果将整个数据库加载到 ram 光盘中(一个只有 15M 行的小表将很容易装入 ram 光盘),速度会更快吗?

另外(正如其他人所指出的),发布解释计划。

但如此小的数据库应该总是很快,因为除了最弱的服务器之外,它可以容纳在所有服务器的内存中。

What hardware are you looking at here? I assume you have a server with a reasonable amount of ram and key_buffer set quite large (say, bigger than the combined index sizes of the two modest sized tables). I assume the server is an otherwise idle performance-test server.

Can you measure the amount of IOs?

If you repeat the exact same query, is it fast?

If you load the entire database into a ram disc (a small table with only 15M rows is going to fit in a ram disc quite easily) is it faster?

Also (as others have noted), post the EXPLAIN plan.

But such a small database should always be fast as it will fit in ram on all but the puniest of servers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文