SQL查询:大表之间的内连接优化
我在 MySQL 4.x DB 中有以下 3 个表:
- 主机:(300.000 条记录)
- id(无符号整数)主键
- 名称(VARCHAR 100)
- 路径:(6.000.000 条记录)
- id(无符号整数)主键
- 名称(VARCHAR 100)
- url:(7.000.000 条记录)
- 主机(UNSIGNED INT)主键<---指向hosts.id的链接
- 路径(无符号整数)主键 <--- 链接到 paths.id
如您所见,模式非常简单,但是问题是这些表中的数据量。
这是我正在运行的查询:
SELECT CONCAT(H.name, P.name)
FROM hosts AS H
INNER JOIN urls as U ON H.id = U.host
INNER JOIN paths AS P ON U.path = P.id;
该查询运行得很好,但需要 50 分钟才能运行。 有谁知道我如何加快查询速度?
提前致谢。 尼古拉斯
I have the 3 following tables in a MySQL 4.x DB :
- hosts: (300.000 records)
- id (UNSIGNED INT) PRIMARY KEY
- name (VARCHAR 100)
- paths: (6.000.000 records)
- id (UNSIGNED INT) PRIMARY KEY
- name (VARCHAR 100)
- urls: (7.000.000 records)
- host (UNSIGNED INT) PRIMARY KEY <--- links to hosts.id
- path (UNSIGNED INT) PRIMARY KEY <--- links to paths.id
As you can see, the schema is really simple but the problem is the amount of data in these tables.
Here is the query I'm running :
SELECT CONCAT(H.name, P.name)
FROM hosts AS H
INNER JOIN urls as U ON H.id = U.host
INNER JOIN paths AS P ON U.path = P.id;
This query works perfectly fine, but takes 50 minutes to run. Does anyone have any idea about how I could speed up that query?
Thanks in advance.
Nicolas
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
也许您应该包含一个 WHERE 子句? 或者您真的需要所有数据吗?
Perhaps you should include a WHERE clause? Or do you really need ALL the data?
在我看来,这就像过度热衷使用代理键会减慢速度的情况。 如果表是:
主机:
路径:
url:
那么您的查询根本不需要连接:
确实,表 URL 会占用更多磁盘空间 - 但这有关系吗?
编辑:再想一想,该 PATHS 表的意义何在? 不同主机共享相同路径的频率如何?
为什么不:
主机:
url:
EDIT2: 或者如果您确实需要主机的代理键:
主机:
url:
从 URL U 中选择 CONCAT(H.name, U.path)
JOIN 主机 H ON H.id = U.host;
This looks to me like a case where over-zealous use of surrogate keys is slowing you down. If the tables were:
hosts :
paths :
urls :
Then your query would require no joins at all:
True, table URLS would occupy more disk space - but does that matter?
EDIT: On second thoughts, what is the point of that PATHS table anyway? How often do different hosts share the same paths?
Why not:
hosts :
urls :
EDIT2: Or if you really need the surrogate key for hosts:
hosts :
urls :
SELECT CONCAT(H.name, U.path) FROM urls U
JOIN hosts H ON H.id = U.host;
一方面,我不会在查询中执行 CONCAT。 在外面做。
但实际上,您的查询运行缓慢,因为您正在检索数百万行。
For one thing I wouldn't do the CONCAT in the query. Do it outside.
But really you're query runs slowly because you're retrieving millions of rows.
总的来说,最好的建议是进行跟踪和分析,看看什么真正占用了时间。 但以下是我对具体问题的看法。
(1) 我想说的是,您希望确保在执行此查询时不使用索引。 由于没有过滤条件,因此全面扫描所有表,然后通过排序合并或哈希操作将它们连接在一起应该会更有效。
(2) 字符串连接肯定需要一些时间,但我不明白为什么人们建议删除它。 然后您可能需要在另一段代码中进行串联,这仍然需要大约相同的时间(除非 MySQL 的字符串串联由于某种原因特别慢)。
(3) 从服务器到客户端的数据传输可能花费大量时间,很可能比服务器获取数据所需的时间还要长。 如果您有追踪此类事情的工具,请使用它们。 如果您可以增加客户端中的获取数组大小,请尝试不同的大小(例如,在 JDBC 中使用 Statement.setFetchSize() )。 即使客户端和服务器位于同一主机上,这也可能很重要。
Overall, the best advice is to trace and profile to see what is really taking up time. But here are my thoughts about specific things to look at.
(1) I would say that you want to ensure that indexes are NOT used in the execution of this query. Since you have no filtering conditions, it should be more efficient to full-scan all the tables and then join them together with a sort-merge or hash operation.
(2) The string concatenation is surely taking some time, but I don't understand why people are recommending to remove it. You would presumably then need to do the concatenation in another piece of code, where it would still take about the same amount of time (unless MySQL's string concatenation is particularly slow for some reason).
(3) The data transferral from the server to the client is probably taking significant time, quite possibly more than the time the server needs to fetch the data. If you have tools to trace this sort of thing, use them. If you can increase the fetch array size in your client, experiment with different sizes (e.g. in JDBC use Statement.setFetchSize() ). This can be significant even if the client and server are on the same host.
我会尝试使用您想要获取的数据创建一个新表。 这样做意味着您会丢失一些真实数据,但您会赢得速度。 这个想法是否类似于 OLAP 或者类似的东西?
当然,您必须对该表进行更新(每天或其他方式)。
I'd try to create a new table with the data you wanna get. Doing this means that you lose some real data but you win in quickness. Could this idea be similar to OLAP or something like that?
Of course, you have to do an update (daily or whatever) of this table.
我不是 MySQL 专家,但看起来 MySQL 主键是聚集的——您需要确保您的主键也是如此; 聚集索引肯定会有助于加快速度。
但有一件事——我不相信任何桌子上都可以有两个“主”键; 由于这个原因,你的 urls 表在我看来相当可疑。 最重要的是,您应该绝对确保 urls 表中的这两列都被索引到了尽头——每一列都有一个数字索引就可以了——因为您要连接它们,所以 DBMS 需要知道如何快速找到他们; 这可能就是你的情况。 如果您要全表扫描那么多行,那么是的,您可能会在那里坐很长一段时间,而服务器会尝试找到您所要求的所有内容。
我还建议从 select 语句中删除 CONCAT 函数,并看看它如何影响您的结果。 如果这不是一个促成因素,我会感到惊讶。 只需检索两列并随后处理串联,然后看看效果如何。
最后,你知道瓶颈在哪里了吗? 如果表已正确索引,那么仅仅加入三个数百万行的表根本不会花费太多时间(我预计可能需要一秒钟左右,只需观察您的表和查询)。 但是,如果您将这些行通过慢速或已经固定的 NIC 推送到内存匮乏的应用程序服务器等,则缓慢可能与您的查询完全无关,而是与查询后发生的情况有关。 700 万行是需要组装和移动的大量数据,无论查找这些行需要多长时间。 尝试只选择一行,而不是全部七百万行,然后对比看看结果如何。 如果速度很快,那么问题不在于查询,而在于结果集。
I'm no MySQL expert, but it looks like MySQL primary keys are clustered -- you'll want to make sure that's the case with your primary keys; clustered indexes will definitely help speed things up.
One thing, though -- I don't believe you can have two "primary" keys on any table; your urls table looks rather suspect to me for that reason. Above all, you should make absolutely sure those two columns in the urls table are indexed to the hilt -- a single numeric index on each one should be fine -- because you're joining on them, so the DBMS needs to know how to find them quickly; that could be what's going on in your case. If you're full-table-scanning that many rows, then yes, you could be sitting there for quite some time while the server tries to find everything you asked for.
I'd also suggest removing that CONCAT function from the select statement, and seeing how that affects your results. I'd be amazed if that weren't a contributing factor somehow. Just retrieve both columns and handle the concatenation afterward, and see how that goes.
Lastly, have you figured out where the bottleneck is? Just joining on three several-million-row tables shouldn't take much time at all (I'd expect maybe a second or so, just eyeballing your tables and query), provided the tables are properly indexed. But if you're pushing those rows over a slow or already-pegged NIC, to a memory-starved app server, etc., the slowness could have nothing to do with your query at all, but instead with what happens after the query. Seven million rows is quite a bit of data to be assembling and moving around, regardless of how long the finding of those rows happens to take. Try selecting just one row instead, rather than all seven million, and see how that looks by contrast. If that's fast, then the problem isn't the query, it's the result set.
由于结果集返回所有数据,因此几乎无法进行优化。 您正在扫描整个表,然后加入具有索引的其他表。
主键是否聚集? 这确保了数据按索引顺序存储在磁盘上,从而避免在磁盘的不同部分之间跳跃。
此外,您还可以将数据分布在多个磁盘上。 如果您在 PRIMARY 上有 URL,在 SECONDARY 上有 PATHS/HOSTS,那么您将从驱动器获得更好的吞吐量。
As your result set returns all data, there is very little optimisation that can be done at all. You're scanning the whole table, then joining on other tables that have indexes.
Are the PrimaryKeys Clustered? This ensures that the data is stored on the disk in the index order, so avoiding bouncing around different parts of the disk.
Also, you can have the data spread over multiple disks. If you have URLs on PRIMARY and PATHS/HOSTS on SECONDARY then you'll get better throughput from the drives.
您需要查看您的服务器配置。 MySQL 的默认内存参数会降低该大小的表的性能。 如果您使用默认值,则需要将
key_buffer_size
和join_buffer_size
至少提高 4 倍,甚至更多。 查看文档; 您还可以调整其他内存参数。MySQL 有一个有趣的性能怪癖,如果您的表超过一定大小且查询将返回大部分数据,性能就会下降。 不幸的是,它无法告诉您何时达到该阈值。 不过,在我看来,就像你一样。
You need to look at your server configuration. The default memory parameters for MySQL will cripple performance on a table that size. If you are using the defaults, you need to raise at least
key_buffer_size
andjoin_buffer_size
by at least a factor of 4, perhaps much more. Look in the documentation; there are other memory parameters you can tweak.MySQL has a funny performance quirk where if your tables go over a certain size with queries that will return most of the data, performance goes into the toilet. Unfortunately, it has no way of telling you when that threshold is reached. It looks to me like you have, though.
在运行查询之前尝试优化表:
这可能会节省您一些时间,尤其是在已从表中删除行的情况下。
(有关 OPTIMIZE 的更多信息,请参阅此处)
Try optimizing your tables before you run the query:
It might save you some time, especially if rows have been deleted from the tables.
(see here for more information on OPTIMIZE)
您是否已经在连接属性上声明了一些索引?
PS:参见
此处[损坏的链接] MySQL 4.x 上的索引Have you already declared some indexes on the join-attributes?
PS: See
here[broken link] for indexes on MySQL 4.xconcat 肯定会减慢你的速度。 我们能看到 mysql 解释的结果吗? 文档链接
最重要的是尝试和拉动不过只有您需要的数据。 如果你能提取更少的记录,那么你的速度就会尽可能快。 但是 mysql 解释应该可以帮助我们了解是否有任何索引会有所帮助。
The concat is definitely slowing you down. Can we see the results of a mysql explain on this? Documentation Link
The biggest thing to do is to try and pull only the data you need though. If you can pull fewer records that will speed you up as much as anything. But a mysql explain should help us see if any indexes would help.
我知道您想要一个完整的 url 列表 - 这是 700 万条记录。
也许按照米奇的建议你应该考虑使用 WHERE 子句来过滤结果。
记录检查时间的延迟有关
也许时间主要与显示该查询的
如果这仍然很慢我会去检查时间
select count(*) from urls
然后
只是
找到速度变慢的根源
有时重新排序查询会有所帮助
I understand that you want a complete list of urls - which is 7 million records.
Perhaps as sugested by Mitch you should consider using the WHERE clause to filter your results.
Perhaps the timing is mainly related to the delay in displaying records
check time for this query
If this is still slow I would go and check timing for
select count(*) from urls
then
then
just to locate the source of the slow down
Also sometimes reordering your query can help
我不能肯定地说 mySQL,但我知道在 SQL Server 中主键会自动创建索引,但外键不会。 确保检查外键字段上是否有索引。
I can't say for sure about mySQL but I know in SQL Server that primary keys create an index automatically but foreign keys do not. Make sure to check that there is an index on your foreign key fields.
由于我不是 MySQL 的忠实粉丝,所以我想问一下您是否尝试过 PostgreSQL。 在该数据库中,您需要确保您的 work_mem 设置相当高,但您可以例如使用 SET work_mem = 64MB 对每个数据库连接进行设置。
另一个建议是研究使用重复的路径条目。 有许多共享路径的 URL。
另一件可能有帮助也可能没有帮助的事情是使用固定长度的文本字段而不是 varchar。 它曾经会产生速度差异,但我不确定当前的数据库引擎。
如果你确实使用 PostgreSQL,它会让你使用 JOIN USING,但即使在 MySQL 上我也更喜欢它:在每个表中将你的 id 字段命名为相同。 不要将主机中的 id 和 url 中的主机命名为 host_id 。
现在再做一些评论。 :)
当您选择一小组行(可能是来自同一域的每个 URL)时,此处的数据布局非常有用。 如果您的查询经常需要对 url 表中存储的其他数据进行顺序扫描,它也可以提供很大帮助,因为扫描可以跳过大文本字段(除非这并不重要,因为无论如何,您的数据库通过指向链接表的指针存储文本)。
但是,如果您几乎总是选择所有域和路径数据,那么将其存储在一张表中更有意义。
Since I am not a big MySQL fan, I would ask if you have tried PostgreSQL. In that DB, you would want to make sure that your work_mem setting was quite high, but you can set it per DB connection with SET work_mem = 64MB, for example.
Another suggestion is to look into using duplicate path entries. There are many URLs that share paths.
Another thing that might or might not help is using fixed-length text fields instead of varchars. It used to make a speed difference but I'm not sure about current DB engines.
If you do use PostgreSQL it will let you use JOIN USING but even on MySQL I like it more: name your id field the same in every table. Instead of id in hosts and host in urls, name it host_id both places.
Now some more commentary. :)
This data layout you have here is very useful when you are selecting a small set of rows, perhaps every URL from the same domain. It can also help a lot if your queries often need to do sequential scans of the urls table for other data stored there, because the scan can skip over the large text fields (Unless it doesn't matter because your DB stores text via pointers to a linked table anyway).
However, if you almost always select all the domain and path data, then it makes more sense to store it in one table.