根据一组文档的相似性对句子进行排名的最佳方法

发布于 2024-12-24 01:32:43 字数 201 浏览 1 评论 0原文

我想知道根据一组文档的相似性对句子进行排名的最佳方法。
例如，可以说，
1.共有5个文档。
2. 每个文档包含很多句子。
3. 让我们将文档 1 作为主要的，即输出将包含该文档中的句子。
4. 输出应该是句子列表，排序方式为：排名第一的句子是所有 5 个文档中最相似的句子，然后是第二个，然后是第三个......

提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

负佳期 2024-12-31 01:32:43

我将介绍文本文档匹配的基础知识...

大多数文档相似性度量都是基于单词而不是句子结构。第一步通常是词干提取。单词被简化为其词根形式，以便相似单词的不同形式（例如“游泳”和“游泳”）相匹配。

此外，您可能希望过滤匹配的单词以避免噪音。特别是，您可能希望忽略出现的“the”和“a”。事实上，您可能希望省略很多连词和代词，因此通常您会有一长串此类单词 - 这称为“停用词列表"。

此外，您可能希望避免匹配一些坏词，例如脏话或种族诽谤词。因此，您可能有另一个包含此类单词的排除列表，即“不良列表”。

现在您可以统计文档中的相似单词。问题变成如何衡量总文档相似度。您需要创建一个评分函数，将相似的单词作为输入并给出“相似度”值。如果同一个单词在两个文档中多次出现，这样的函数应该给出很高的值。此外，此类匹配按总词频进行加权，因此当不常见的单词匹配时，它们会被赋予更多的统计权重。

Apache Lucene 是一个用 Java 编写的开源搜索引擎，提供有关以下方面的实用详细信息：这些步骤。例如，以下是有关如何加权查询相似性的信息：

http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html

Lucene 将信息检索的布尔模型 (BM) 与
信息检索的向量空间模型（VSM） - 文档
BM“批准”由VSM评分。

所有这一切实际上只是匹配文档中的单词。您确实指定了匹配的句子。对于大多数人来说，匹配单词更有用，因为您可以拥有多种真正含义相同的句子结构。最有用的相似性信息就在文字中。我已经讨论过文档匹配，但就您而言，句子只是一个非常小的文档。

现在，顺便说一句，如果您不关心句子中的实际名词和动词，只关心语法构成，则需要采用不同的方法......

首先您需要一个链接语法解析器来解释语言并构建表示句子的数据结构（通常是树）。然后你必须执行不精确的图形匹配。这是一个难题，但有一些算法可以在多项式时间内在树上完成此任务。

I'll cover the basics of textual document matching...

Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.

Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".

Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".

So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.

Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:

http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html

Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.

All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.

Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...

First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.

回复收藏 0 原文