全文搜索的相关性是如何衡量的?

发布于 2024-07-08 02:45:07 字数 660 浏览 5 评论 0 原文

我正在制作一个测验系统,当测验制作者将问题插入问题库时,我将检查数据库中是否有重复/非常相似的问题。

测试 MySQL 的 MATCH() ... AGAINST(),当我针对 100% 相似的字符串进行测试时,我得到的最高相关性是 30+。

那么具体的相关性是什么呢? 引用手册

相关性值是非负浮点数。 零相关性意味着没有相似性。 相关性是根据行中的单词数、该行中的唯一单词数、集合中的单词总数以及包含特定单词的文档(行)数来计算的。

我的问题是如果字符串重复,如何测试相关值。 如果它 100% 重复,请防止将其插入题库。 但如果只是如此相似,则提示测验制作者验证、插入或不插入。 那么我该怎么做呢? 100% 相同的字符串的 30+ 不是百分比,所以我很困惑。

提前致谢。

I am making a quiz system, and when quizmakers insert questions into the Question Bank, I am to check the DB for duplicate / very highly similar questions.

Testing MySQL's MATCH() ... AGAINST(), the highest relevance I get is 30+, when I test against a 100% similar string.

So what exactly is the relevance? To quote the manual:

Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

My problem is how to test the relevance value if a string is a duplicate. If it's 100% duplicate, prevent it from being inserter into Question Bank. But if it is only so similar, prompt the quizmaker to verify, insert or not. So how do I do that? 30+ for 100% identical string is not percentage, so I'm stump.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

橘虞初梦 2024-07-15 02:45:07

文本检索系统的基本数据结构是倒排索引。 这本质上是在文档集合中找到的单词列表以及它们出现的文档列表。它还可以包含有关每个文档出现情况的元数据,例如单词出现的次数。

可以通过匹配搜索词来查询包含该词的文档。 为了确定相关性,我们会根据命中。 这是通过为 n 个搜索项中的每一个构建一个具有一个分量的 n 维向量来实现的。 如果需要,您还可以对搜索词进行加权。 该向量给出了 n 维空间中与您的搜索词相对应的点。

可以根据倒排索引构造基于每个文档中的加权出现次数的相似向量,其中向量中的每个轴与每个搜索项的轴相对应。 如果计算这些向量的点积,您将得到它们之间角度的余弦。 1.0 相当于 cos (0),它假设向量占据从原点开始的公共直线。 向量越接近,角度越小,余弦值越接近 1.0。

如果您按余弦对搜索结果进行排序(或将它们放入优先级队列中 mg是)你得到最相关的。 更聪明的相关性算法往往会调整搜索词的权重,使点积偏向于具有高相关性的词。

如果您想深入了解,请参阅 管理千兆字节 ://www.cosc.canterbury.ac.nz/tim.bell/" rel="noreferrer">贝尔 和 Moffet 讨论了文本检索系统的内部架构。

The basic data structure for a text retrieval system is an Inverted Index. This is essentially a list of words found in the document collection with a list of the documents they occur in. It can also have metadata about the occurrence for each document, such as the number of times the word appears.

Documents containing the words can be queried by matching on the search terms. To determine relevance, a heuristic known as a Cosine Ranking is calculated on the hits. This works by constructing n-dimensional vector with one component for each of the n search terms. You can also weight the search terms if desired. This vector gives a point in n-dimensional space that corresponds to your search terms.

A similar vector based on the weighted occurrences in each document can be constructed from the inverted index with each axis in the vector corresponding with the axis for each search term. If you calculate a dot product of these vectors you get the cosine of the angle between them. 1.0 is equivalent to cos (0), which would assume the vectors occupy a common line from the origin. The closer the vectors together, the smaller the angle and the closer the cosine is to 1.0.

If you sort the search results by the cosine (or bung them into a priority queue as mg does) you get the most relevant. Cleverer relevance algorithms tend to fiddle with the weights of the search terms, skewing the dot product in favour of terms with high relevance.

If you want to dig a little, Managing Gigabytes by Bell and Moffet discusses the internal architecture of text retrieval systems.

(り薆情海 2024-07-15 02:45:07

andygeers 的观​​点是正确的:这些数字除了彼此之间的关系之外没有任何经验意义,并且不能单独用于确定什么是或不是“完全匹配”。 你需要自己确定这一点。 即使除了全文搜索排名的限制之外,还存在一个悬而未决的问题:您认为什么构成“完全匹配”。 (仅实际文本或 soundex 匹配算在内吗?同义词(例如“couch”与“sofa”)算作匹配还是不同?是否应该尝试弥补拼写错误?等等)

如果我需要执行这样的检查,我将只获取全文搜索返回的排名最高的条目,删除任何指定的停用词,规范化空格,转换为小写,进行比较,然后保留它,直到遇到需要它的情况有待进一步细化。 这并不是所有的额外工作 - 如果您指定应用程序使用的语言,您可能会在这里找到可以在十几行代码内编写规范化函数的人。

andygeers is on the right track: Those numbers have no empirical meaning other than their relations to each other and cannot be used on their own to determine what is or is not an "exact match". You need to determine that yourself. Even aside from the limitations of fulltext search ranking, there's also the open question of just what you consider to consitiute an "exact match". (Actual text only or do soundex matches count? Do synonyms (e.g., "couch" vs. "sofa") count as matching or as distinct? Should an attempt be made to compensate for misspellings? Etc.)

If I had the need to perform such a check, I would grab only the highest-ranked entry returned by the fulltext search, remove any designated stopwords, normalize whitespace, convert to lowercase, do the comparison, and leave it at that until I encountered a case that called for it to be refined further. It's not really all that much extra work - if you specify the language you're using for your application, you could probably find someone around here who could write the normalization function within a dozen or so lines of code.

濫情▎り 2024-07-15 02:45:07

我不知道您正在使用的 MySQL 函数的具体情况,但我想这些数字可能没有绝对的含义 - 它们只是设计用于与同一函数生成的其他值进行比较。 要检查绝对匹配,您可以选择文本本身并手动比较。

I don't know the specifics of the MySQL function you're using, but I imagine it could be that there is no absolute meaning for those numbers - they're just designed to be compared with other values produced by the same function. To check for an absolute match you could select out the text itself and compare manually.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文