使用余弦相似度测量的 n 元句子相似度

发布于 2024-09-29 05:24:39 字数 787 浏览 14 评论 0原文

我一直在从事一个关于句子相似性的项目。我知道它已经被问过很多次了，但我只是想知道我的问题是否可以通过我正在做的方式使用的方法来完成，或者我应该改变我解决问题的方法。粗略地说，系统应该分割一篇文章的所有句子，并在输入系统的其他文章中找到相似的句子。

我使用余弦相似度和 tf-idf 权重，这就是我的做法。

1-首先，我将所有文章分成句子，然后为每个句子生成三元组并对它们进行排序（应该吗？）。

2-我计算三元组的 tf-idf 权重并为所有句子创建向量。

3-我计算原始句子和要比较的句子的点积和大小。然后计算余弦相似度。

然而，系统并没有按照我的预期工作。说到这里，我心中有一些疑问。

据我读过有关 tf-idf 权重的内容，我想它们对于查找类似的“文档”更有用。由于我正在处理句子，因此我通过更改 tf 和 idf 定义公式的一些变量（而不是我尝试提出基于句子的定义的文档）来稍微修改了算法。

tf = 句子中三元词出现的次数 / 句子中所有三元词的数量

idf = 所有文章中所有句子的数量 / 三元词出现的句子数

你认为这个问题使用这样的定义可以吗？

另外一个是我在计算余弦相似度时看到多次提到归一化。我猜测这很重要，因为三元组向量可能大小不同（在我的情况下很少如此）。如果一个三元向量的大小为 x，另一个向量的大小为 x+1，那么我将第一个向量视为 x+1 的大小，最后一个值为 0。这就是归一化的含义吗？如果不是，我该如何进行标准化？

除了这些之外，如果我选择了错误的算法，还有什么可以用于解决此类问题（最好使用 n-gram 方法）？

先感谢您。

原文

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an article and find similar sentences among other articles that are fed to the system.

I am using cosine similarity with tf-idf weights and that is how I did it.

1- First, I split all the articles into sentences, then I generate trigrams for each sentence and sort them(should I?).

2- I compute the tf-idf weights of trigrams and create vectors for all sentences.

3- I calculate the dot product and magnitude of original sentence and of the sentence to be compared. Then calculate the cosine similarity.

However, the system does not work as I expected. Here, I have some questions in my mind.

As far as I have read about tf-idf weights, I guess they are more useful for finding similar "documents". Since I am working on sentences, I modified the algorithm a little by changing some variables of the formula of tf and idf definitions(instead of document I tried to come up with sentence based definition).

tf = number of occurrences of trigram in sentence / number of all trigrams in sentence

idf = number of all sentences in all articles / number of sentences where trigram appears

Do you think it is ok to use such a definition for this problem?

Another one is that I saw the normalization is mentioned many times when calculating the cosine similarity. I am guessing that this is important because the trigrams vectors might not be the same size(which they rarely are in my case). If a trigram vector is size of x and the other is x+1, then I treat the first vector as it was the size of x+1 with the last value is being 0. Is this what it is meant by normalization? If not, how do I do the normalization?

Besides these, if I have chosen the wrong algorithm what else can be used for such problem(preferably with n-gram approach)?

Thank you in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

匿名。 2024-10-06 05:24:39

我不知道你为什么要对每个句子的卦进行排序。计算余弦相似度时，您需要关心的是两个句子中是否出现相同的三元词以及出现的频率。从概念上讲，您在所有可能的三元组中定义了一个固定且通用的顺序。请记住，所有句子的顺序必须相同。如果可能的三元组数为 N，那么对于每个句子，您将获得一个维度为 N 的向量。如果某个三元组没有出现，则将向量中的相应值设置为零。您实际上并不需要存储零，但在定义点积时必须处理它们。

话虽如此，三元组并不是一个好的选择，因为匹配的机会很少。对于高 k，您将从 k 个连续单词的包中获得更好的结果，而不是 k-gram。请注意，袋子内的顺序并不重要，它是一套。您使用的是 k=3 k-gram，但这似乎偏高，尤其是对于句子。要么降到二元组，要么使用不同长度的袋子，从 1 开始。最好两者都使用。

我相信您已经注意到，不使用精确三元组的句子在您的方法中具有 0 相似度。 K-词袋
会有所缓解，但并不能彻底解决问题。因为现在你需要句子来分享实际的单词。两个句子可能相似，但不使用相同的单词。有几种方法可以解决这个问题。使用 LSI（潜在语义索引）或单词聚类并使用聚类标签来定义余弦相似度。

为了计算向量 x 和 y 之间的余弦相似度，您需要计算点积并除以 x 和 y 的范数。
向量 x 的 2-范数可以计算为分量平方和的平方根。但是，您还应该在没有任何标准化的情况下尝试您的算法进行比较。通常它工作得很好，因为在计算术语频率 (tf) 时，您已经考虑了句子的相对大小。

希望这有帮助。

I am not sure why you are sorting the trigrams for every sentence. All you need to care about when computing cosine similarity is that whether the same trigram occurred in the two sentences or not and with what frequencies. Conceptually speaking you define a fixed and common order among all possible trigrams. Remember the order has to be the same for all sentences. If the number of possible trigrams is N, then for each sentence you obtain a vector of dimensionality N. If a certain trigram does not occur, you set the corresponding value in the vector to zero. You dont really need to store the zeros, but have to take care of them when you define the dot product.

Having said that, trigrams are not a good choice as chances of a match are a lot sparser. For high k you will have better results from bags of k consecutive words, rather than k-grams. Note that the ordering does not matter inside a bag, its a set. You are using k=3 k-grams, but that seems to be on the high side, especially for sentences. Either drop down to bi-grams or use bags of different lengths, starting from 1. Preferably use both.

I am sure you have noticed that sentences that do not use the exact trigram has 0 similarity in your method. K-bag of words
will alleviate the situation somewhat but not solve it completely. Because now you need sentences to share actual words. Two sentences may be similar without using the same words. There are a couple of ways to fix this. Either use LSI(latent Semantic Indexing) or clustering of the words and use the cluster labels to define your cosine similarity.

In order to compute the cosine similarity between vectors x and y you compute the dot product and divide by the norms of x and y.
The 2-norm of the vector x can be computed as square root of the sum of the components squared. However you should also try your algorithm out without any normalization to compare. Usually it works fine, because you are already taking care of the relative sizes of the sentences when you compute the term frequencies (tf).

Hope this helps.

回复收藏 0 原文

~没有更多了~