使用“like”优化MySQL搜索和通配符

发布于 2024-08-18 21:34:18 字数 188 浏览 5 评论 0原文

如何

SELECT * FROM sometable WHERE somefield LIKE '%value%'

优化这样的查询？

这里的主要问题是第一个通配符阻止 DBMS 使用索引。

编辑：此外，某些字段值是纯字符串（不是一段文本），因此无法执行全文搜索。

原文

How can queries like

SELECT * FROM sometable WHERE somefield LIKE '%value%'

be optimized?

The main issue here is the first wildcard which prevents DBMS from using index.

Edit: What is more, somefield value is solid string (not a piece of text) so fulltext search could not be performed.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

狠疯拽 2024-08-25 21:34:18

你的弦有多长？

如果它们相对较短（例如英语单词；avg_len=5）并且您有空闲的数据库存储空间，请尝试以下方法：

对于要存储在表中的每个单词，取该单词的所有可能的后缀。换句话说，你不断地剥离第一个字符，直到什么都没有剩下。例如，单词 value 给出：
- 值
- 价值
- lue
- ue
- e
存储每个< /em> 数据库中的这些后缀。
您现在可以使用 LIKE 'alu%' 搜索子字符串（这将查找 'alu' 作为 'value' 的一部分）。

通过存储所有后缀，您不再需要前导通配符（允许使用索引进行快速查找），但代价是存储空间。

存储成本

存储一个单词所需的字符数变为word_len*word_len / 2，即每个单词的单词长度的二次方。以下是不同字长的增加系数：

3 个字母的单词：(3*3/2) / 3 = 1.5
5 个字母的单词：(5*5/2) / 5 = 2.5
7 个字母的单词：(7*7/2) / 7 = 3.5
12 个字母的单词：(12*12/2) / 12 = 6< /code>

存储一个单词所需的行数从 1 增加到 word_len。请注意此开销。应将附加列保持在最低限度，以避免存储大量冗余数据。例如，最初找到该单词的页码应该没问题（想想 unsignedsmallint），但该单词的大量元数据应该基于每个单词而不是每个后缀存储在单独的表中。

注意事项

我们在分割“单词”（或片段）时需要进行权衡。举一个现实世界的例子：我们如何处理连字符？我们将形容词五个字母存储为一个单词还是两个单词？

权衡如下：

任何被分解的东西都不能作为单个元素被发现。如果我们分别存储 Five 和 letter ，则搜索 Five-letter 或 Fiveletter 将失败。
任何未分解的内容都将占用更多存储空间。请记住，存储
要求字长呈二次方增加。

为了方便起见，您可能需要删除连字符并存储 Fiveletter。现在可以通过搜索 Five、letter 和 Fiveletter 找到该单词。（如果您也从任何搜索查询中删除连字符，用户仍然可以成功找到五个字母。）

最后，有一些存储后缀数组的方法不会产生太多开销，但我还没有确定它们是否能很好地转化为数据库。

How long are your strings?

If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:

For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word value gives:
- value
- alue
- lue
- ue
- e
Store each of these suffixes in the database.
You can now search for substrings using LIKE 'alu%' (which will find 'alu' as part of 'value').

By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.

Storage Cost

The number of characters required to store a word becomes word_len*word_len / 2, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:

3-letter word: (3*3/2) / 3 = 1.5
5-letter word: (5*5/2) / 5 = 2.5
7-letter word: (7*7/2) / 7 = 3.5
12-letter word: (12*12/2) / 12 = 6

The number of rows required to store a word increases from 1 to word_len. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.

Considerations

There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter as one word or two?

The trade-off is as follows:

Anything that is broken up cannot be found as a single element. If we store five and letter separately, searching for five-letter or fiveletter will fail.
Anything that is not broken up will take more storage space. Remember, the storage
requirement increases quadratically in the word length.

For convenience, you might want to remove the hyphen and store fiveletter. The word can now be found by searching five, letter, and fiveletter. (If you strip hyphens from any search query as well, users can still successfully find five-letter.)

Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.

回复收藏 0 原文