我可以做些什么来优化这个 mysql 查询吗?
我希望你们中的一些 mysql 专家可以帮助我优化我的 mysql 搜索查询...
首先,一些背景:
我正在开发一个具有搜索功能的小型练习 mysql 应用程序。
数据库中的每个练习可以属于任意数量的嵌套类别,并且每个练习还可以具有任意数量的与其关联的搜索标签。
这是我的数据结构(为了可读性而简化)
TABLE exercises
ID
title
TABLE searchtags
ID
title
TABLE exerciseSearchtags
exerciseID -> exercises.ID
searchtagID -> searchtags.ID
TABLE categories
ID
parentID -> ID
title
TABLE exerciseCategories
exerciseID -> exercises.ID
categoryID -> categories.ID
所有表都是InnoDB(没有全文搜索)。
练习、搜索标签和类别的 ID 列已建立索引。
“exerciseSearchtags”和“exerciseCategories”是多对多连接表,分别表示练习和搜索标签以及练习和类别之间的关系。运动 ID 和运动 ID 均是运动 ID。 searchtagID 列已在exerciseSearchtags 中建立索引,exerciseID 和categoryID 列已在exerciseCategories 中建立索引。
以下是练习标题、类别标题和搜索标签标题数据的一些示例。所有三种类型的标题中都可以包含多个单词。
Exercises
(ID - title)
1 - Concentric Shoulder Internal Rotation in Prone
2 - Straight Leg Raise Dural Mobility (Sural)
3 - Push-Ups
Categories
(ID - title)
1 - Flexion
2 - Muscles of Mastication
3 - Lumbar Plexus
Searchtags
(ID - title)
1 - Active Range of Motion
2 - Overhead Press
3 - Impingement
现在,进行搜索查询:
搜索引擎接受任意数量的用户输入的关键字。
我想根据关键字/类别标题匹配、关键字/搜索标签标题匹配以及关键字/练习标题匹配的数量对搜索结果进行排名。
为了实现这一点,我使用以下动态生成的 SQL:
SELECT
exercises.ID AS ID,
exercises.title AS title,
(
// for each keyword, the following
// 3 subqueries are generated
(
SELECT COUNT(1)
FROM categories
LEFT JOIN exerciseCategories
ON exerciseCategories.categoryID = categories.ID
WHERE categories.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseCategories.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM searchtags
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.searchtagID = searchtags.ID
WHERE searchtags.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseSearchtags.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM exercises AS exercises2
WHERE exercises2.title RLIKE CONCAT('[[:<:]]',?)
AND exercises2.ID = exercises.ID
)
// end subqueries
) AS relevance
FROM
exercises
LEFT JOIN exerciseCategories
ON exerciseCategories.exerciseID = exercises.ID
LEFT JOIN categories
ON categories.ID = exerciseCategories.categoryID
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.exerciseID = exercises.ID
LEFT JOIN searchtags
ON searchtags.ID = exerciseSearchtags.searchtagID
WHERE
// for each keyword, the following
// 3 conditions are generated
categories.title RLIKE CONCAT('[[:<:]]',?) OR
exercises.title RLIKE CONCAT('[[:<:]]',?) OR
searchtags.title RLIKE CONCAT('[[:<:]]',?)
// end conditions
GROUP BY
exercises.ID
ORDER BY
relevance DESC
LIMIT
$start, $results
所有这些都运行良好。它根据用户输入返回相关搜索结果。
但是,我担心我的解决方案可能无法很好地扩展。例如,如果用户输入七个关键字搜索字符串,这将导致在相关性计算中产生包含 21 个子查询的查询,如果表变大,这可能会开始减慢速度。
有人对如何优化上述内容有任何建议吗?有更好的方法来实现我想要的吗?我在上述内容中犯了任何明显的错误吗?
预先感谢您的帮助。
I am hoping some of you who are experts in mysql can help me to optimize my mysql search query...
First, some background:
I am working on a small exercise mysql application that has a search feature.
Each exercise in the database can belong to an arbitrary number of nested categories, and each exercise can also have an arbitrary number of searchtags associated with it.
Here is my data structure (simplified for readability)
TABLE exercises
ID
title
TABLE searchtags
ID
title
TABLE exerciseSearchtags
exerciseID -> exercises.ID
searchtagID -> searchtags.ID
TABLE categories
ID
parentID -> ID
title
TABLE exerciseCategories
exerciseID -> exercises.ID
categoryID -> categories.ID
All tables are InnoDB (no full-text searching).
The ID columns for exercises, searchtags and categories have been indexed.
"exerciseSearchtags" and "exerciseCategories" are many to many join tables expressing the relationship between exercises and searchtags, and exercises and categories, respectively. Both the exerciseID & searchtagID columns have been indexed in exerciseSearchtags, and both the exerciseID and categoryID columns have indexed in exerciseCategories.
Here are some examples of what exercise title, category title and searchtag title data might look like. All three types can have multiple words in the title.
Exercises
(ID - title)
1 - Concentric Shoulder Internal Rotation in Prone
2 - Straight Leg Raise Dural Mobility (Sural)
3 - Push-Ups
Categories
(ID - title)
1 - Flexion
2 - Muscles of Mastication
3 - Lumbar Plexus
Searchtags
(ID - title)
1 - Active Range of Motion
2 - Overhead Press
3 - Impingement
Now, on to the search query:
The search engine accepts an arbitrary number of user inputted keywords.
I would like to rank search results based on the number of keyword/category title matches, keyword/searchtag title matches, and keyword/exercise title matches.
To accomplish this, I am using the following dynamically generated SQL:
SELECT
exercises.ID AS ID,
exercises.title AS title,
(
// for each keyword, the following
// 3 subqueries are generated
(
SELECT COUNT(1)
FROM categories
LEFT JOIN exerciseCategories
ON exerciseCategories.categoryID = categories.ID
WHERE categories.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseCategories.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM searchtags
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.searchtagID = searchtags.ID
WHERE searchtags.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseSearchtags.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM exercises AS exercises2
WHERE exercises2.title RLIKE CONCAT('[[:<:]]',?)
AND exercises2.ID = exercises.ID
)
// end subqueries
) AS relevance
FROM
exercises
LEFT JOIN exerciseCategories
ON exerciseCategories.exerciseID = exercises.ID
LEFT JOIN categories
ON categories.ID = exerciseCategories.categoryID
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.exerciseID = exercises.ID
LEFT JOIN searchtags
ON searchtags.ID = exerciseSearchtags.searchtagID
WHERE
// for each keyword, the following
// 3 conditions are generated
categories.title RLIKE CONCAT('[[:<:]]',?) OR
exercises.title RLIKE CONCAT('[[:<:]]',?) OR
searchtags.title RLIKE CONCAT('[[:<:]]',?)
// end conditions
GROUP BY
exercises.ID
ORDER BY
relevance DESC
LIMIT
$start, $results
All of this works just fine. It returns relevant search results based on user input.
However, I am worried that my solution may not scale well. For example, if a user enters a seven keywords search string, that will result in a query with 21 subqueries in the relevance calculation, which might start to slow things down, if the tables get big.
Does anyone have any suggestions as to how I can optimize the above? Is there a better way to accomplish what I want? Am I making any glaring errors in the above?
Thanks in advance for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您还提供了一些数据,特别是每个表格中的一些示例关键字和示例
title
,我也许能够提供更好的答案,以便我们了解您想要的内容实际上匹配上。但我会尝试用你所提供的内容来回答。首先让我用英语输入我认为您的查询会执行的操作,然后我将详细说明原因和解决方法。
假设您至少有一些合理的索引,则结果为 E x C x (C + S + 1) ,其中 E 是练习数量,
C
是给定练习的平均类别数,S
是给定搜索标签的平均数。如果您至少没有列出的 ID 上的索引,那么它的性能会更差。因此,问题的一部分特别取决于C
和S
的相对大小,我目前只能猜测。如果E
为 1000,C
和S
各约为 2-3,那么您将扫描 8-21000 行。如果E
为 100 万,C
为 2-3,S
为 10-15,则您将扫描 26-57 百万行。如果E
为 100 万,而C
或S
约为 1000,那么您将扫描超过 1 万亿行。所以不,这根本无法很好地扩展。1) 子查询内的 LEFT JOIN 被忽略,因为这些相同查询上的 WERE 子句强制它们成为普通 JOIN。这不会对性能产生太大影响,但它确实会混淆您的意图。
2) RLIKE(及其别名 REGEXP)永远不会使用索引 AFAIK,因此它们永远不会扩展。我只能在没有示例数据的情况下猜测,但我想说,如果您的搜索需要匹配单词边界,那么您需要规范化您的数据。即使您的标题看起来像是要存储的自然字符串,搜索其中的一部分也意味着您实际上将它们视为单词的集合。因此,您应该使用 mysql 的全文搜索功能 否则,您应该将标题分成单独的表,每行存储一个单词。每个单词一行显然会增加您的存储空间,但会使您的查询几乎微不足道,因为您似乎只进行整个单词匹配(而不是相似的单词、词根等)。
3) 您最后的左连接是导致我的公式中的
E x C
部分的原因,您将为每个练习执行相同的工作C
次。现在,诚然,在大多数查询计划下,子查询将为每个类别进行缓存,因此实际上它并不像我建议的那么糟糕,但并非在所有情况下都是如此,所以我给您提供最坏的情况。即使您可以验证是否拥有正确的索引并且查询优化器已经避免了所有这些额外的表扫描,您仍然会返回大量冗余数据,因为您的结果将如下所示:因为每个练习行都是重复的每个exercisecategory条目,即使您没有从exercisecategory或类别中返回任何内容(并且第一个子查询中的categories.ID实际上引用了该子查询中加入的类别,而不是外部查询中的类别)。
4) 由于大多数搜索引擎使用分页返回结果,我猜你只需要前 X 个结果。在查询中添加 LIMIT X,或者更好的是 LIMIT Y, X,其中 Y 是当前页面,X 是每页返回的结果数,如果搜索关键字返回大量结果,将极大地帮助优化您的查询。
如果您可以向我们提供有关您的数据的更多信息,我可以更新我的答案以反映这一点。
更新
根据您的回复,这是我建议的查询。不幸的是,如果没有全文搜索或索引词,如果您的类别表或搜索标签表非常大,仍然会出现缩放问题。
按相关性 DESC 排序
具有相关性> 0
LIMIT $start, $results
我通常不会推荐 HAVING 子句,但它不会比你的 RLIKE ... OR RLIKE ... 等更糟糕。
这解决了我的问题 #1、#3、#4,但留下了问题#2 还剩下。鉴于您的示例数据,我想每个表最多只有几十个条目。在这种情况下,RLIKE 的低效率可能还不够痛苦,不值得每行一个单词的优化,但您确实询问了缩放问题。只有完全相等 (
title = ?
) 查询或开头查询 (title LIKE 'foo%'
) 才能使用索引,如果您要放大任何表中的行。无论使用什么正则表达式,RLIKE 和 REGEXP 都不符合这些条件(并且您的查询是“包含”之类的查询,这是最坏的情况)。 (重要的是要注意title LIKE CONCAT(?, '%')
不够好,因为 mysql 认为它必须计算某些内容并忽略其索引。您需要在中添加 '%'您的申请。)I might me be able to provide a better answer if you also provided some data, particular some example keywords and example
title
s from each of your tables so we can get a sense of what you're trying to actually match on. But I will try to answer with what you have provided.First let me put in English what I think your query will do and then I'll break down the reasons why and ways to fix it.
Assuming that you have at least a few sane indexes, this will work out to be
E x C x (C + S + 1)
whereE
is the number of exercises,C
is the average number of categories for a given exercise, andS
is the average number of search tags for a given. If you don't have indexes on at least the IDs you listed, then it will perform far worse. So part of the question depends particularly on the relative sizes ofC
andS
which I can currently only guess at. IfE
is 1000 andC
andS
are each about 2-3 then you'll be scanning 8-21000 rows. IfE
is 1 million andC
is 2-3 andS
is 10-15, you'll be scanning 26-57 million rows. IfE
is 1 million andC
orS
is about 1000, then you'll be scanning well over 1 trillion rows. So no, this won't scale well at all.1) The LEFT JOINs inside of your subqueries are ignored because the WERE clauses on those same queries forces them to be normal JOINs. This doesn't affect performance much but it does obfuscate your intent.
2) RLIKE (and its alias REGEXP) do not ever utilize indexes AFAIK so they will not ever scale. I can only guess without sample data but I would say that if your searches require matching on word boundaries that you are in need of normalizing your data. Even if your titles seem like natural strings to store, searching through part of them means you're really treating them as a collection of words. So you should either make use of mysql's full text search capabilities or else you should break you titles out into separate tables that store one word per row. The one row per word will obviously increase your storage but would make your queries almost trivial since you appear to only be doing whole word matches (as opposed to similar words, word roots, etc).
3) The final left joins you have are what cause the
E x C
part of my formula, you will being doing the same workC
times for every exercise. Now, admittedly, under most query plans the subqueries will be cached for each category and so its not in practice quite as bad as I'm suggesting but that will not be true in every case so I'm giving you the worst case scenario. Even if you could verify that you have the proper indexes in place and the query optimizer has avoided all those extra table scans, you will still be returning lots of redundant data because your results will look something like this:because each exercise row is duplicated for each exercisecategory entry even though you're not returning anything from exercisecategory or categories (and the categories.ID in your first subquery is actually referencing the categories joined in that subquery NOT the one from the outer query).
4) Since most search engines return results using paging, I would guess you only really need the first X results. Adding a LIMIT X to your query, or better yet LIMIT Y, X where Y is the current page and X is the number of results returned per page will greatly help optimize your query if the search keywords return lots of results.
If you can provide us with a little more information on your data, I can update my answer to reflect that.
UPDATE
Based on your responses, here is my suggested query. Unfortunately, without full text search or indexed words, there are still going to be scaling problems if either your category table or your search tag table is very large.
ORDER BY relevance DESC
HAVING relevance > 0
LIMIT $start, $results
I wouldn't normally recommend a HAVING clause but its not gonna be any worse than your RLIKE ... OR RLIKE ..., etc.
This addresses my issues #1, #3, #4 but leaves #2 still remaining. Given your example data, I would imagine that each table only has at most a few dozen entries. In that case, the inefficiency of RLIKE might not be painful enough to be worth the optimizations of one word per row but you did ask about scaling. Only an exact equality (
title = ?
) query or a starts with query (title LIKE 'foo%'
) can use indexes which are an absolute necessity if you are going to scale up the rows in any table. RLIKE and REGEXP don't fit those criteria, no matter the regular expression used (and yours is a 'contains' like query which is the worst case). (It's important to note thattitle LIKE CONCAT(?, '%')
is NOT good enough because mysql sees that it has to calculate something and ignores its index. You need to add the '%' in your application.)尝试运行查询的解释计划并查看当前不使用索引的行。有策略地为这些行添加索引。
另外,如果可能,请减少查询中 RLIKE 调用的数量,因为这些调用的成本很高。
考虑在数据库前面使用诸如 memcached 之类的东西来缓存结果以减少数据库负载。
Try running explain plan for the query and look at the rows that currently do not use an index. Add indexes strategically for those rows.
Also, if possible, reduce the number of RLIKE calls in the query, as those will be expensive.
Consider caching results to reduce database load using something like memcached in front of the database.