如何操纵 MySQL 全文搜索相关性以使某个字段更“有价值”？比另一个？

发布于 2024-07-13 19:43:11 字数 615 浏览 12 评论 0原文

假设我有两列，关键字和内容。我对两者都有全文索引。我希望关键字中包含 foo 的行比内容中包含 foo 的行具有更高的相关性。我需要做什么才能使MySQL对关键字匹配的权重高于内容匹配的权重？

我正在使用“匹配”语法。

解决方案：

能够通过以下方式完成这项工作：

SELECT *, 
CASE when Keywords like '%watermelon%' then 1 else 0 END as keywordmatch, 
CASE when Content like '%watermelon%' then 1 else 0 END as contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
HAVING relevance > 0  
ORDER by keywordmatch desc, contentmatch desc, relevance desc

原文

Suppose I have two columns, keywords and content. I have a fulltext index across both. I want a row with foo in the keywords to have more relevance than a row with foo in the content. What do I need to do to cause MySQL to weight the matches in keywords higher than those in content?

I'm using the "match against" syntax.

SOLUTION:

Was able to make this work in the following manner:

SELECT *, 
CASE when Keywords like '%watermelon%' then 1 else 0 END as keywordmatch, 
CASE when Content like '%watermelon%' then 1 else 0 END as contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
HAVING relevance > 0  
ORDER by keywordmatch desc, contentmatch desc, relevance desc

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

听，心雨的声音 2024-07-20 19:43:12

创建三个全文索引

a) 一个在关键字列
b) 一个在内容列
c) 一个在关键字和内容列

然后，您的查询：

SELECT id, keyword, content,
  MATCH (keyword) AGAINST ('watermelon') AS rel1,
  MATCH (content) AGAINST ('watermelon') AS rel2
FROM table
WHERE MATCH (keyword,content) AGAINST ('watermelon')
ORDER BY (rel1*1.5)+(rel2) DESC

重点是 rel1 为您提供以下相关性您的查询仅在 keyword 列中（因为您仅在该列上创建了索引）。 rel2 执行相同的操作，但针对 content 列。现在，您可以应用您喜欢的任何权重将这两个相关性分数加在一起。

但是，您没有使用这两个索引中的任何一个进行实际搜索。为此，您使用第三个索引，该索引位于两列上。

（关键字，内容）上的索引控制着您的回忆。又名，返回的内容。

两个单独的索引（一个仅针对关键字，一个仅针对内容）控制您的相关性。您可以在此处应用您自己的权重标准。

请注意，您可以使用任意数量的不同索引（或者，根据其他因素改变您在查询时使用的索引和权重，也许...仅在查询包含停用词时搜索关键字...减少权重偏差关键字（如果查询包含超过 3 个单词...等等）。

每个索引都会占用磁盘空间，因此索引越多，磁盘就越多。反过来，mysql 的内存占用也更高。此外，插入将花费更长的时间，因为您有更多索引需要更新。

您应该根据您的情况对性能进行基准测试（小心关闭 mysql 查询缓存进行基准测试，否则您的结果将出现偏差）。这不是谷歌级的效率，但它非常简单并且“开箱即用”，并且几乎可以肯定它比您在查询中使用“like”要好得多。

我发现它效果非常好。

Create three full text indexes

a) one on the keyword column
b) one on the content column
c) one on both keyword and content column

Then, your query:

SELECT id, keyword, content,
  MATCH (keyword) AGAINST ('watermelon') AS rel1,
  MATCH (content) AGAINST ('watermelon') AS rel2
FROM table
WHERE MATCH (keyword,content) AGAINST ('watermelon')
ORDER BY (rel1*1.5)+(rel2) DESC

The point is that rel1 gives you the relevance of your query just in the keyword column (because you created the index only on that column). rel2 does the same, but for the content column. You can now add these two relevance scores together applying any weighting you like.

However, you aren't using either of these two indexes for the actual search. For that, you use your third index, which is on both columns.

The index on (keyword,content) controls your recall. Aka, what is returned.

The two separate indexes (one on keyword only, one on content only) control your relevance. And you can apply your own weighting criteria here.

Note that you can use any number of different indexes (or, vary the indexes and weightings you use at query time based on other factors perhaps ... only search on keyword if the query contains a stop word ... decrease the weighting bias for keywords if the query contains more than 3 words ... etc).

Each index does use up disk space, so more indexes, more disk. And in turn, higher memory footprint for mysql. Also, inserts will take longer, as you have more indexes to update.

You should benchmark performance (being careful to turn off the mysql query cache for benchmarking else your results will be skewed) for your situation. This isn't google grade efficient, but it is pretty easy and "out of the box" and it's almost certainly a lot lot better than your use of "like" in the queries.

I find it works really well.

回复收藏 0 原文

时光清浅 2024-07-20 19:43:12

实际上，使用 case 语句来制作一对标志可能是一个更好的解决方案：

select 
...
, case when keyword like '%' + @input + '%' then 1 else 0 end as keywordmatch
, case when content like '%' + @input + '%' then 1 else 0 end as contentmatch
-- or whatever check you use for the matching
from 
   ... 
   and here the rest of your usual matching query
   ... 
order by keywordmatch desc, contentmatch desc

同样，只有当所有关键字匹配的排名高于所有仅内容匹配时，才会出现这种情况。我还假设关键字和内容的匹配是最高排名。

Actually, using a case statement to make a pair of flags might be a better solution:

select 
...
, case when keyword like '%' + @input + '%' then 1 else 0 end as keywordmatch
, case when content like '%' + @input + '%' then 1 else 0 end as contentmatch
-- or whatever check you use for the matching
from 
   ... 
   and here the rest of your usual matching query
   ... 
order by keywordmatch desc, contentmatch desc

Again, this is only if all keyword matches rank higher than all the content-only matches. I also made the assumption that a match in both keyword and content is the highest rank.

回复收藏 0 原文

水水月牙 2024-07-20 19:43:12

仅使用 2 个全文索引的更简单版本（来自 @mintywalker）：

SELECT id, 
   MATCH (`content_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance1,  
   MATCH (`title_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance2
FROM search_table
HAVING (relevance1 + relevance2) > 0
ORDER BY (relevance1 * 1.5) + (relevance2) DESC
LIMIT 0, 1000;

这将根据 关键字 搜索两个完整索引列，并选择匹配的相关性到两个单独的列中。我们将排除不匹配的项目（相关性 1 和相关性 2 均为零），并通过增加 content_ft 列的权重对结果重新排序。我们不需要复合全文索引。

Simpler version using only 2 fulltext indexes (credits taken from @mintywalker):

SELECT id, 
   MATCH (`content_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance1,  
   MATCH (`title_ft`) AGAINST ('keyword*' IN BOOLEAN MODE) AS relevance2
FROM search_table
HAVING (relevance1 + relevance2) > 0
ORDER BY (relevance1 * 1.5) + (relevance2) DESC
LIMIT 0, 1000;

This will search both full indexed columns against the keyword and select matched relevance into two separate columns. We will exclude items with no match (relevance1 and relevance2 are both zero) and reorder results by increased weight of content_ft column. We don't need composite fulltext index.

回复收藏 0 原文

泪痕残 2024-07-20 19:43:12

在布尔模式下，MySQL 支持 > 和 < 运算符来更改单词对分配给行的相关值的贡献。

我想知道这样的东西是否有效？

SELECT *, 
MATCH (Keywords) AGAINST ('>watermelon' IN BOOLEAN MODE) AS relStrong, 
MATCH (Title,Keywords,Content) AGAINST ('<watermelon' IN BOOLEAN MODE) AS relWeak 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
ORDER by (relStrong+relWeak) desc

In Boolean mode, MySQL supports the > and < operators to change a word's contribution to the relevance value that is assigned to a row.

I wonder if something like this would work?

SELECT *, 
MATCH (Keywords) AGAINST ('>watermelon' IN BOOLEAN MODE) AS relStrong, 
MATCH (Title,Keywords,Content) AGAINST ('<watermelon' IN BOOLEAN MODE) AS relWeak 
FROM about_data  
WHERE MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE) 
ORDER by (relStrong+relWeak) desc

回复收藏 0 原文

萌辣 2024-07-20 19:43:12

好吧，这取决于您的确切含义：

我想要一行关键字中包含 foo
比一行有更多的相关性
内容中包含 foo。

如果您的意思是关键字中包含 foo 的行应该出现在内容中包含 foo 的任何行之前，那么我将执行两个单独的查询，一个针对关键字，然后（可能是懒惰的，只有在需要时）内容上的另一个。

回复收藏 0 原文

高跟鞋的旋律 2024-07-20 19:43:12

我几年前这样做过，但没有全文索引。我手边没有代码（前雇主），但我清楚地记得该技术。

简而言之，我从每一列中选择了一个“权重”。例如：

select table.id, keyword_relevance + content_relevance as relevance from table
   left join
      (select id, 1 as keyword_relevance from table_name where keyword match) a
   on table.id = a.id
   left join
      (select id, 0.75 as content_relevance from table_name where content match) b
   on table.id = b.id

请原谅这里任何粗制滥造的 SQL，自从我需要编写任何 SQL 以来已经有几年了，而且我正在凭空做这件事......

希望这会有所帮助！

杰斯

I did this a few years ago, but without the full text index. I don't have the code handy (former employer), but I remember the technique well.

In a nutshell, I selected a "weight" from each column. For example:

select table.id, keyword_relevance + content_relevance as relevance from table
   left join
      (select id, 1 as keyword_relevance from table_name where keyword match) a
   on table.id = a.id
   left join
      (select id, 0.75 as content_relevance from table_name where content match) b
   on table.id = b.id

Please forrgive any shoddy SQL here, it's been a few years since I needed to write any, and I'm doing this off the top of my head...

Hope this helps!

J.Js

回复收藏 0 原文

伴我心暖 2024-07-20 19:43:12

我需要类似的东西并使用了OP的解决方案，但我注意到全文与部分单词不匹配。因此，如果“西瓜”作为单词的一部分出现在关键字或内容中（例如 watermelonsalesmanager），则它不匹配，并且由于 WHERE MATCH 而不会包含在结果中。
所以我闲逛了一下，将OP的查询调整为：

SELECT *, 
CASE WHEN Keywords LIKE '%watermelon%' THEN 1 ELSE 0 END AS keywordmatch, 
CASE WHEN Content LIKE '%watermelon%' THEN 1 ELSE 0 END AS contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE (Keywords LIKE '%watermelon%' OR 
  Title LIKE '%watermelon%' OR 
  MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE)) 
HAVING (keywordmatch > 0 OR contentmatch > 0 OR relevance > 0)  
ORDER BY keywordmatch DESC, contentmatch DESC, relevance DESC

希望这有帮助。

I needed something similar and used the OP's solution, but I noticed that fulltext doesn't match partial words. So if 'watermelon' is in Keywords or Content as part of a word (like watermelonsalesmanager) it doesn't MATCH and is not included in the results because of the WHERE MATCH.
So I fooled around a bit and tweaked the OP's query to this:

SELECT *, 
CASE WHEN Keywords LIKE '%watermelon%' THEN 1 ELSE 0 END AS keywordmatch, 
CASE WHEN Content LIKE '%watermelon%' THEN 1 ELSE 0 END AS contentmatch,
MATCH (Title, Keywords, Content) AGAINST ('watermelon') AS relevance 
FROM about_data  
WHERE (Keywords LIKE '%watermelon%' OR 
  Title LIKE '%watermelon%' OR 
  MATCH(Title, Keywords, Content) AGAINST ('watermelon' IN BOOLEAN MODE)) 
HAVING (keywordmatch > 0 OR contentmatch > 0 OR relevance > 0)  
ORDER BY keywordmatch DESC, contentmatch DESC, relevance DESC

Hope this helps.

回复收藏 0 原文

百思不得你姐 2024-07-20 19:43:12

据我所知，MySQL 全文搜索不支持此功能，但您可以通过在关键字字段中多次重复该单词来达到效果。
不要使用关键字“foo bar”，而是使用“foo bar foo bar foo bar”，这样 foo 和 bar 在关键字列中同样重要，并且由于它们出现多次，因此与 mysql 更加相关。

我们在我们的网站上使用它并且它有效。

回复收藏 0 原文

踏月而来 2024-07-20 19:43:12

如果指标只是所有关键字匹配比所有内容匹配更“有价值”，那么您可以只使用行数的并集。沿着这些思路。

select *
from (
   select row_number() over(order by blahblah) as row, t.*
   from thetable t
   where keyword match

   union

   select row_number() over(order by blahblah) + @@rowcount + 1 as row, t.*
   from thetable t
   where content match
)
order by row

对于比这更复杂的事情，你想对每一行应用实际权重，我不知道如何提供帮助。

If the metric is just that all the keyword matches are more "valuable" than all the content matches then you can just use a union with row counts. Something along these lines.

select *
from (
   select row_number() over(order by blahblah) as row, t.*
   from thetable t
   where keyword match

   union

   select row_number() over(order by blahblah) + @@rowcount + 1 as row, t.*
   from thetable t
   where content match
)
order by row

For anything more complicated than that, where you want to apply an actual weight to every row, I don't know how to help.

回复收藏 0 原文

~没有更多了~