使用“like”优化MySQL搜索和通配符
如何
SELECT * FROM sometable WHERE somefield LIKE '%value%'
优化这样的查询?
这里的主要问题是第一个通配符阻止 DBMS 使用索引。
编辑:此外,某些字段值是纯字符串(不是一段文本),因此无法执行全文搜索。
How can queries like
SELECT * FROM sometable WHERE somefield LIKE '%value%'
be optimized?
The main issue here is the first wildcard which prevents DBMS from using index.
Edit: What is more, somefield value is solid string (not a piece of text) so fulltext search could not be performed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你的弦有多长?
如果它们相对较短(例如英语单词;avg_len=5)并且您有空闲的数据库存储空间,请尝试以下方法:
value
给出:值
价值
lue
ue
e
LIKE 'alu%'
搜索子字符串(这将查找 'alu' 作为 'value' 的一部分)。通过存储所有后缀,您不再需要前导通配符(允许使用索引进行快速查找),但代价是存储空间。
存储成本
存储一个单词所需的字符数变为
word_len*word_len / 2
,即每个单词的单词长度的二次方。以下是不同字长的增加系数:(3*3/2) / 3 = 1.5
(5*5/2) / 5 = 2.5
(7*7/2) / 7 = 3.5
(12*12/2) / 12 = 6< /code>
存储一个单词所需的行数从 1 增加到
word_len
。请注意此开销。应将附加列保持在最低限度,以避免存储大量冗余数据。例如,最初找到该单词的页码应该没问题(想想 unsignedsmallint),但该单词的大量元数据应该基于每个单词而不是每个后缀存储在单独的表中。注意事项
我们在分割“单词”(或片段)时需要进行权衡。举一个现实世界的例子:我们如何处理连字符?我们将形容词
五个字母
存储为一个单词还是两个单词?权衡如下:
Five 和
letter ,则搜索
Five-letter 或
Fiveletter 将失败。
要求字长呈二次方增加。
为了方便起见,您可能需要删除连字符并存储
Fiveletter
。现在可以通过搜索Five
、letter
和Fiveletter
找到该单词。 (如果您也从任何搜索查询中删除连字符,用户仍然可以成功找到五个字母
。)最后,有一些存储后缀数组的方法不会产生太多开销,但我还没有确定它们是否能很好地转化为数据库。
How long are your strings?
If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:
value
gives:value
alue
lue
ue
e
LIKE 'alu%'
(which will find 'alu' as part of 'value').By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.
Storage Cost
The number of characters required to store a word becomes
word_len*word_len / 2
, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:(3*3/2) / 3 = 1.5
(5*5/2) / 5 = 2.5
(7*7/2) / 7 = 3.5
(12*12/2) / 12 = 6
The number of rows required to store a word increases from 1 to
word_len
. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.Considerations
There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective
five-letter
as one word or two?The trade-off is as follows:
five
andletter
separately, searching forfive-letter
orfiveletter
will fail.requirement increases quadratically in the word length.
For convenience, you might want to remove the hyphen and store
fiveletter
. The word can now be found by searchingfive
,letter
, andfiveletter
. (If you strip hyphens from any search query as well, users can still successfully findfive-letter
.)Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.
有两种方法:
(1)使用内存表,因此速度非常快。
(2) 制定比 foo LIKE '%bar%' 更好的索引和搜索算法。在不了解您的问题的更多信息的情况下,不可能对此提出任何建议。
正如您所指出的,%bar% 模式保证每次查找都会进行表扫描,这会抵消数据库软件中任何可能的搜索独创性。
Two ways:
(1) use an in-memory table so it goes very fast.
(2) cook up a better index and search algorithm than
foo LIKE '%bar%'
. It's not possible to make any suggestions about this without knowing more about your problem.As you have pointed out, the %bar% pattern guarantees a table-scan for every lookup, which nullifies any possible search ingenuity in the database software.
使用全文搜索。 “初始想法”标题具有相同的示例,并导致可行的示例解决方案。
和 MySQL 文档
编辑:它不能在 SQL 本身中进行调整。使用 LOCATE 或 PATINEX 等函数也无济于事。
Use Full Text Search. The "Initial Idea" heading has the same example and leads to worked example solution.
And the MySQL docs
Edit: It can't be tuned in SQL itself. Using functions like LOCATE or PATINEX won't help either.
考虑到您的问题在于通配符,它不会产生巨大的差异,但不使用“SELECT *”将提高查询性能。如果您实际上没有使用返回的所有字段,那么这是一个胜利,“SELECT *”会导致触发两个查询,一个查询用于查找表的字段,然后是添加了字段名称的查询。
It won't make a huge difference, given your problem is with the wildcard, but not using "SELECT *" will improve query performance. If you're not actually using all the fields you get back, that's a win and "SELECT *" causes two queries to fire, one to look up the fields for the table and then your query with the field names added in.