使用“like”优化MySQL搜索和通配符

发布于 2024-08-18 21:34:18 字数 188 浏览 5 评论 0原文

如何

SELECT * FROM sometable WHERE somefield LIKE '%value%'

优化这样的查询?

这里的主要问题是第一个通配符阻止 DBMS 使用索引。

编辑:此外,某些字段值是纯字符串(不是一段文本),因此无法执行全文搜索。

How can queries like

SELECT * FROM sometable WHERE somefield LIKE '%value%'

be optimized?

The main issue here is the first wildcard which prevents DBMS from using index.

Edit: What is more, somefield value is solid string (not a piece of text) so fulltext search could not be performed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

狠疯拽 2024-08-25 21:34:18

你的弦有多长?

如果它们相对较短(例如英语单词;avg_len=5)并且您有空闲的数据库存储空间,请尝试以下方法:

  • 对于要存储在表中的每个单词,取该单词的所有可能的后缀。换句话说,你不断地剥离第一个字符,直到什么都没有剩下。例如,单词 value 给出:
    • 价值
    • lue
    • ue
    • e
  • 存储每个< /em> 数据库中的这些后缀。
  • 您现在可以使用 LIKE 'alu%' 搜索子字符串(这将查找 'alu' 作为 'value' 的一部分)。

通过存储所有后缀,您不再需要前导通配符(允许使用索引进行快速查找),但代价是存储空间。

存储成本

存储一个单词所需的字符数变为word_len*word_len / 2,即每个单词的单词长度的二次方。以下是不同字长的增加系数:

  • 3 个字母的单词:(3*3/2) / 3 = 1.5
  • 5 个字母的单词:(5*5/2) / 5 = 2.5
  • 7 个字母的单词:(7*7/2) / 7 = 3.5
  • 12 个字母的单词:(12*12/2) / 12 = 6< /code>

存储一个单词所需的行数从 1 增加到 word_len。请注意此开销。应将附加列保持在最低限度,以避免存储大量冗余数据。例如,最初找到该单词的页码应该没问题(想想 unsignedsmallint),但该单词的大量元数据应该基于每个单词而不是每个后缀存储在单独的表中。

注意事项

我们在分割“单词”(或片段)时需要进行权衡。举一个现实世界的例子:我们如何处理连字符?我们将形容词五个字母存储为一个单词还是两个单词?

权衡如下:

  • 任何被分解的东西都不能作为单个元素被发现。如果我们分别存储 Five 和 letter ,则搜索 Five-letter 或 Fiveletter 将失败。
  • 任何未分解的内容都将占用更多存储空间。请记住,存储
    要求字长呈二次方增加。

为了方便起见,您可能需要删除连字符并存储 Fiveletter。现在可以通过搜索 FiveletterFiveletter 找到该单词。 (如果您也从任何搜索查询中删除连字符,用户仍然可以成功找到五个字母。)

最后,有一些存储后缀数组的方法不会产生太多开销,但我还没有确定它们是否能很好地转化为数据库。

How long are your strings?

If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:

  • For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word value gives:
    • value
    • alue
    • lue
    • ue
    • e
  • Store each of these suffixes in the database.
  • You can now search for substrings using LIKE 'alu%' (which will find 'alu' as part of 'value').

By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.

Storage Cost

The number of characters required to store a word becomes word_len*word_len / 2, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:

  • 3-letter word: (3*3/2) / 3 = 1.5
  • 5-letter word: (5*5/2) / 5 = 2.5
  • 7-letter word: (7*7/2) / 7 = 3.5
  • 12-letter word: (12*12/2) / 12 = 6

The number of rows required to store a word increases from 1 to word_len. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.

Considerations

There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter as one word or two?

The trade-off is as follows:

  • Anything that is broken up cannot be found as a single element. If we store five and letter separately, searching for five-letter or fiveletter will fail.
  • Anything that is not broken up will take more storage space. Remember, the storage
    requirement increases quadratically in the word length.

For convenience, you might want to remove the hyphen and store fiveletter. The word can now be found by searching five, letter, and fiveletter. (If you strip hyphens from any search query as well, users can still successfully find five-letter.)

Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.

凉栀 2024-08-25 21:34:18

有两种方法:

(1)使用内存表,因此速度非常快。

(2) 制定比 foo LIKE '%bar%' 更好的索引和搜索算法。在不了解您的问题的更多信息的情况下,不可能对此提出任何建议。

正如您所指出的,%bar% 模式保证每次查找都会进行表扫描,这会抵消数据库软件中任何可能的搜索独创性。

Two ways:

(1) use an in-memory table so it goes very fast.

(2) cook up a better index and search algorithm than foo LIKE '%bar%'. It's not possible to make any suggestions about this without knowing more about your problem.

As you have pointed out, the %bar% pattern guarantees a table-scan for every lookup, which nullifies any possible search ingenuity in the database software.

捎一片雪花 2024-08-25 21:34:18

使用全文搜索。 “初始想法”标题具有相同的示例,并导致可行的示例解决方案。

和 MySQL 文档

编辑:它不能在 SQL 本身中进行调整。使用 LOCATE 或 PATINEX 等函数也无济于事。

Use Full Text Search. The "Initial Idea" heading has the same example and leads to worked example solution.

And the MySQL docs

Edit: It can't be tuned in SQL itself. Using functions like LOCATE or PATINEX won't help either.

蓝咒 2024-08-25 21:34:18

考虑到您的问题在于通配符,它​​不会产生巨大的差异,但不使用“SELECT *”将提高查询性能。如果您实际上没有使用返回的所有字段,那么这是一个胜利,“SELECT *”会导致触发两个查询,一个查询用于查找表的字段,然后是添加了字段名称的查询。

It won't make a huge difference, given your problem is with the wildcard, but not using "SELECT *" will improve query performance. If you're not actually using all the fields you get back, that's a win and "SELECT *" causes two queries to fire, one to look up the fields for the table and then your query with the field names added in.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文