文本相似度算法
我正在做一个 Java 项目,我必须在其中制作一个文本相似度程序。我希望它获取 2 个文本文档,然后将它们相互比较并获得其相似度。他们彼此多么相似。
稍后我将放置一个已经存在的数据库,该数据库可以找到这些单词的同义词,并浏览文本以查看其中一位文本文档编写者是否只是将单词更改为其他同义词,而文本却完全相同。向上或向下移动段落也是如此。 是的,因为它是一个抄袭程序...
我想听听你们会推荐什么样的算法。
通过查看这里和其他地方,我发现了 Levenstein 和 Cosine 的相似性。这两个人似乎被提及很多。汉明距离是我的老师告诉我的另一个问题。
我有一些与这些相关的问题,因为我并没有真正了解维基百科。有人可以向我解释这些事情吗?
Levenstein:该算法通过子、添加和消除单词来改变,并查看它与文本文档中的其他单词有多接近。但是如何在整个文本文件上使用它呢?我可以看到它如何用于一个单词,但不能用于一个句子或整个文本文档。
余弦:通过测量两个向量之间角度的余弦来衡量两个向量之间的相似度。我在这里不明白两个文本如何变成两个向量,其中的单词/句子又如何?
Hamming:这个距离似乎比 Levenstein 效果更好,但只有在相同的字符串上才有效。当两个文档甚至其中的句子不是两个长度相等的字符串时,为什么它很重要呢?
维基百科应该有意义,但事实并非如此。如果这些问题听起来太愚蠢,我很抱歉,但这让我很沮丧,我认为这里有人很有能力解释它,所以即使是这个领域的新手也能明白。
感谢您抽出时间。
I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.
I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down.
Yes, as was it a plagarism program...
I want to hear from you people what kind of algoritms you would recommend.
I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.
I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?
Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.
Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?
Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?
Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.
Thanks for your time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Levenstein:理论上你可以将它用于整个文本文件,但它实际上不太适合这项任务。它实际上是针对单个单词或(最多)一个短语。
余弦:首先,您只需计算每个文档中的唯一单词即可。 上一个问题的答案涵盖了完成后的计算。
我从未将汉明距离用于此目的,所以我不能说太多。
我会将 TFIDF(术语频率 * 倒排文档频率)添加到列表中。它与余弦距离非常相似,但 1) 往往在较短的文档上做得更好,2) 更好地考虑到整个语料库中哪些单词极其常见,而不仅仅是那些碰巧常见的单词到两个特定的文档。
最后一点:要使其中任何一个产生有用的结果,您几乎需要在尝试计算相似度之前筛选出停用词(尽管如果您跳过,TFIDF 似乎比其他方法做得更好)这)。至少根据我的经验,词干(删除后缀)也非常有帮助。完成后,我使用了波特的词干分析器算法。
出于您的目的,您可能想要使用我所说的倒置同义词库,它可以让您查找一个单词,并为每个单词替换一个规范单词来代替该含义。我在一个项目上尝试过这一点,并没有发现它像预期的那样有用,但听起来对于您的项目来说它可能会更有用。
Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.
Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.
I've never used Hamming distance for this purpose, so I can't say much about it.
I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.
One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.
For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.
比较两个文档之间的相似度是信息检索中的一个主题,其基本思想是提取一些指纹,并根据指纹判断它们是否共享某些信息。
只是一些提示,Winnowing:文档指纹识别的本地算法可能是选择和一个好的起点来解决您的问题。
Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.
Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.
考虑维基百科上 Levenshtein 距离的示例:
现在,将“小猫”替换为“第一篇论文中的文本”,将“坐”替换为“第二篇论文中的文本”。
比较这些结果并确定距离分数最低的孩子。
在您的
compareSimilarities
方法中,您可以使用任何或所有比较算法。您可以将其合并到公式中的另一个是“最长公共子串”(这是查找剽窃行为的好方法。)Consider the example on wikipedia for Levenshtein distance:
Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".
Compare those results and peg the kids with the lowest distance scores.
In your
compareSimilarities
method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)