帮我把 SUBQUERY 变成 JOIN
两张桌子。
电子邮件 id(int10)|所有权 (int10)
消息 emailid (int10) 索引 | message (mediumtext)
子查询(这在 mysql 中很糟糕)。
从消息中选择 COUNT(*) 条 WHERE 消息 LIKE '%word%' AND emailid IN (SELECT id FROM emails WHERE Ownership = 32)
这里的用法是我对电子邮件进行搜索(这在上面的示例中显然得到了简化),生成了一个包含 3,000 个电子邮件 ID 的列表。然后,我想要对邮件进行搜索,因为我需要对邮件进行文本匹配 - 仅从 3000 封电子邮件中进行匹配。
对消息的查询成本很高(消息没有索引),但这很好,因为它只会检查几行。
想法:
i) 加入。到目前为止,我的尝试尚未奏效,并导致对消息表进行全表扫描(即未使用 emailid 索引) ii) 临时表。我认为这可行。 iii) 在客户端缓存 ID 并运行 2 个查询。这确实有效。不优雅。 iv) 子查询。 mySQL 子查询每次都会运行第二个查询,因此这是行不通的。也许在 mysql 6 中已修复。
好吧,这是我到目前为止所拥有的。这些是实际的字段名称(我已经简化了一些问题)。
查询:
SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject
ON (ticket_subject.ticketid = ticket.id)
WHERE category IN (1)
AND ticket_subject.subject LIKE "%about%"
结果:
1 SIMPLE ticket ref PRIMARY,category category 4 const 28874
1 SIMPLE ticket_subject eq_ref PRIMARY PRIMARY 4 deskpro.ticket.id 1 Using where
需要 0.41 秒,返回 count(*) 113。
运行:
SELECT COUNT (*) FROM ticket WHERE category IN (1)
需要 0.01 秒,找到 33,000 个结果。
运行
SELECT COUNT (*) FROM ticket_subject WHERE subject LIKE "%about%"
需要 0.14 秒并找到 1,300 个结果。
Ticket 表和ticket_subject 表都有300,000 行。
在ticket_subject.ticketid和ticket.category上有一个索引。
我现在意识到使用 LIKE 语法是一个错误 - 因为它有点转移 FULLTEXT 的注意力。这不是问题。问题是:
1) 表 A - 非常快的查询,在索引上运行。 0.001秒 2) 表 B - 中度至慢速查询,无索引 - 进行全表扫描。 0.1秒。
这两个结果都很好。问题是我必须加入他们并且搜索需要 0.3 秒;这对我来说没有意义,因为表 B 上的组合查询的缓慢方面应该更快,因为我们现在只搜索该表的一小部分 - 即它不应该进行全表扫描,因为正在连接的字段on 已编入索引。
Two tables.
emails
id (int10) | ownership (int10)
messages
emailid (int10) indexed | message (mediumtext)
Subquery (which is terrible in mysql).
SELECT COUNT(*) FROM messages
WHERE message LIKE '%word%' AND
emailid IN (SELECT id FROM emails WHERE ownership = 32)
The usage here is that I run a search on emails (which is obviously simplified in the sample above), that generates a list of say 3,000 email id's. I then want to do a search against messages because i need to do a text match - from only those 3000 emails against the message.
The query against messages is expensive (message is not indexed) but this is fine because it would only ever be checking against a few rows.
Ideas:
i) A join. My attempts at this so far have not worked and have resulted in full table scans of the message table (i.e. the emailid index not used)
ii) temporary table. This could work I think.
iii) cache ids in client and run 2 queries. This does work. Not elegant.
iv) subquery. mySQL subqueries run the 2nd query each time so this does not work. maybe fixed in mysql 6.
Ok, here is what I have so far. These are the actual field names (I had simplified a bit in question).
The query:
SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject
ON (ticket_subject.ticketid = ticket.id)
WHERE category IN (1)
AND ticket_subject.subject LIKE "%about%"
The results:
1 SIMPLE ticket ref PRIMARY,category category 4 const 28874
1 SIMPLE ticket_subject eq_ref PRIMARY PRIMARY 4 deskpro.ticket.id 1 Using where
It takes 0.41 seconds and returns a count(*) of 113.
Running:
SELECT COUNT (*) FROM ticket WHERE category IN (1)
Takes 0.01 seconds and finds 33,000 results.
Running
SELECT COUNT (*) FROM ticket_subject WHERE subject LIKE "%about%"
Takes 0.14 seconds and finds 1,300 results.
Both the ticket table and ticket_subject table have 300,000 rows.
There is an index on ticket_subject.ticketid and ticket.category.
I realise now that using the LIKE syntax was a mistake - as it has been a bit of a red herring about FULLTEXT. THis is not the issue. The issue is:
1) Table A - very fast query, run on index. 0.001 seconds
2) Table B - moderate to slow query, no index - does full table scan. 0.1 seconds.
Both of these results are fine. The problem is I have to JOIN them and the search takes 0.3 seconds; which to me makes no sense because the slow aspects of the combined query on Table B should be quicker because we are now only searching over a fraction of that table - ie it should not be doing a full table scan because the field that is being JOINED on is indexed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
请记住利用布尔短路评估:
此功能按
所有权进行过滤
在评估LIKE
谓词之前。总是把你更便宜的表达方式放在左边。另外,我同意@Martin Smith和@MJB的观点,你应该考虑使用MySQL的
FULLTEXT
索引以加快速度。关于您的评论和附加信息,这里有一些分析:
“使用索引”的注释是一件好事,因为它意味着只需读取索引数据结构即可满足查询,甚至不需要触及表的数据。这肯定跑得很快。
这表明没有可能的键可以使通配符
LIKE
谓词受益。它使用 WHERE 子句中的条件,但必须通过运行表扫描来评估它。同样,访问票证表很快,但会被
LIKE
条件引起的表扫描破坏。你永远不可能让
LIKE
表现良好。请参阅我的演示MySQL 中的实用全文搜索。回复您的评论:好的,我已经在类似大小的数据集上做了一些实验(Stack Overflow 数据转储中的用户和徽章表:-)。这是我的发现:
这真的很快,因为我在声誉列上有一个索引。
这正如预期的那样,因为该表有 700k 行,并且必须执行表扫描。现在让我们进行连接:
这看起来还不错。这是解释报告:
这看起来确实是在智能地使用索引进行连接,并且它有助于我拥有一个包含用户 ID 和信誉的复合索引。请记住,MySQL 每个表只能使用一个索引,因此为您需要执行的查询定义正确的复合索引非常重要。
回复您的评论:好的,我已经尝试过这个,其中声誉> 5000,并且其中声誉> 500,并且其中声誉> 50. 这些应该匹配更多的用户。
解释报告在所有情况下都是相同的,但如果查询在 Users 表中找到更多匹配行,那么它自然必须根据 Badges 表中更多匹配行来评估
LIKE
谓词。确实,进行连接需要一些成本。有点令人惊讶的是它的价格如此之高。但如果您使用索引,则可以减轻这种情况。
我知道您说过您有一个无法使用索引的查询,但也许是时候考虑使用原始列数据的某些转换版本创建一个冗余列,以便您可以对其建立索引。在上面的示例中,我可能会创建一个列
creationdate_day
并从DAYOFYEAR(creationdate)
填充它。这就是我的意思:
这是解释报告:
Remember to take advantage of Boolean short-circuit evaluation:
This filters by
ownership
before it evaluates theLIKE
predicate. Always put your cheaper expressions on the left.Also, I agree with @Martin Smith and @MJB that you should consider using MySQL's
FULLTEXT
indexing to make this faster.Re your comment and additional information, here's some analysis:
The note "Using index" is a good thing to see because it means it can satisfy the query just by reading the index data structure, not even touching the data of the table. This is certain to run very fast.
This shows that there are no possible keys that can benefit the wildcard
LIKE
predicate. It uses the condition in the WHERE clause, but it has to evaluate it by running a table-scan.Likewise, accessing the ticket table is quick, but that's spoiled by the table-scan incurred by the
LIKE
condition.You're never going to make
LIKE
perform well. See my presentation Practical Full-Text Search in MySQL.Re your comment: Okay, I've done some experiments on a dataset of similar size (the Users and Badges tables in the Stack Overflow data dump :-). Here's what I found:
That's really fast, because I have an index on the reputation column.
That's as expected, since the table has 700k rows, and it has to do a table-scan. Now let's do the join:
That doesn't seem so bad. Here's the explain report:
This does seem like it's using indexes intelligently for the join, and it helps that I have a compound index including userid and reputation. Remember that MySQL can use only one index per table, so it's important to get define the right compound indexes for the query you need to do.
Re your comment: OK, I've tried this where reputation > 5000, and where reputation > 500, and where reputation > 50. These should match a much larger set of users.
The explain report is the same in all cases, but if the query finds more matching rows in the Users table, then it naturally has to evaluate the
LIKE
predicate against a lot more matching rows in the Badges table.It's true that there is some cost to doing a join. It's a little surprising that it's so dramatically expensive. But this can be mitigated if you use indexes.
I know you said you have a query that can't use an index, but perhaps it's time to consider creating a redundant column with some transformed version of the data of your original column, so you can index it. In the example above, I might create a column
creationdate_day
and populate it fromDAYOFYEAR(creationdate)
.Here's what I mean:
Here's the explain report:
但问题在于
'%word%'
这始终需要扫描消息。您可能需要查看 全文搜索 如果您正在使用MyISAM
。The problem though is with the
'%word%'
This will always require a scan of message. You might want to look into full text search if you are usingMyISAM
.我认为这就是您正在寻找的:
很难确定它的性能如何。如果FTS是因为WORD上的起始通配符,那么这样做并不能解决问题。但好消息是,连接也许会限制您必须查看的消息表中的记录。
I think this is what you are looking for:
Hard to tell for sure how it will perform. If the FTS is because of the starting wildcard on WORD, then doing it this way won't solve the problem. But the good news is that perhaps the join will limit the records in the messages table you have to look at.
您是否可以将连接反过来?看来第二个查询是一个较便宜的查询,并且由于整个查询是一个简单的联接,因此您希望执行较便宜的查询来尽可能缩小数据集,然后与更昂贵的查询进行联接。
Is it possible for you to turn the join the other way around? It seems that the second query is a less expensive one and since the whole thing is a simple join then you want to perform the less expensive query to narrow the data-set as much and then do a join to your more expensive query.