创建针对全文搜索优化的第二列有什么意义吗?

发布于 2024-09-14 05:46:52 字数 356 浏览 10 评论 0 原文

我正在开发的项目对于需要搜索的每一列都有一个名为“ft[columnname]”的第二列,该列具有全文索引,并且仅搜索这一列。

该列包含一个“优化”文本,该文本是通过以下方式从原始列自动生成的:

  • 字符串小写
  • 所有重音符号都被删除
  • 所有标点符号和不可搜索的字符都被删除
  • 所有重复的单词都被删除
  • 所有单词按从最长到最长的顺序排序最短的
  • 其他我不太理解的转换(与组合词相关)

例如“我喜欢神奇宝贝,尤其是皮卡丘!”变成“特别是像我这样的神奇宝贝皮卡丘”。

是否有任何(即使是很小的)性能优势?数据库中的数据永远不会动态变化。

the project I'm working on has for each column that needs to be searched a second column called "ft[columnname]" which has a FULLTEXT index and only this one is searched against.

This column contains an "optimized" text, that is automatically generated from the original column in the following way:

  • The string is lowercased
  • All accents are removed
  • All punctuations and unsearchable characters are removed
  • All duplicated words are removed
  • All words are sorted from the longest to the shortest
  • Other transformations that I don't really understand (related to combined-words)

For example "I like Pokémon, especially Pikachu!" becomes "especially pokemon pikachu like i".

Is there any (even a very tiny one) performance benefit? The data in the database never dynamically changes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

分開簡單 2024-09-21 05:46:52

对于您的特定应用程序来说,这可能会带来功能上的好处,但重复存储数据在很大程度上会影响性能,而不是带来好处。

由于您的数据现在是两倍大,假设有足够大的数据集,则各个级别的缓存(例如 MySQL、操作系统)中只能保存一半的数据,因此您将从磁盘读取更多数据,这是正常的瓶颈。

话虽如此,如果您在 ft 索引列上使用单字节字符集,但在原始文本上使用多字节字符集,则您的全文索引可能会比其他情况小得多。

There might be a functionality benefit for your specific application, but storing the data in duplicate is largely a performance hit -- not a benefit.

Since your data is now twice as big, assuming a sufficiently large data set, only half as much can be held in the various levels of caching (e.g. MySQL, OS), so you're going to be reading from disk much more, which is the normal bottleneck.

Having said that, if you use single-byte character set on the ft indexed column, but a multi-byte character set on the original text, your full text index may be much smaller than it would have been otherwise.

ˉ厌 2024-09-21 05:46:52

老实说,您不应该在第二列中执行此操作,因为这样做意味着您正在为生产表使用 MyISAM 存储引擎。 (或者如果您可以承受丢失一些数据的损失,请继续)。

事实上,您确实关心性能,因此您应该考虑使用功能强大的全文搜索引擎,例如 Sphinxhttp://www.sphinxsearch.com/

Honestly, you should not do it in 2nd column because by doing so, it implies you are using MyISAM storage engine for a production table. (or go ahead if you can afford to lose some data).

In fact, you do care about the performance, so you should consider using a capable full text search engine such as Sphinx: http://www.sphinxsearch.com/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文