百分比相似度分析 (Java)
我有以下情况:
String a =“网络爬虫是一种自动浏览万维网互联网的计算机程序”; String b = "网络爬虫计算机程序浏览万维网";
有没有任何想法或标准算法来计算相似度百分比?
例如上面的例子,通过人工查看估计的相似度应该是90%++。
我的想法是对两个字符串进行标记并比较匹配的标记数量。像这样的东西 (7 个令牌/1 0 个令牌) * 100。但是,当然,这种方法根本没有效果。比较匹配的字符数似乎也无效......
任何人都可以提供一些指导吗???
以上是我的项目“剽窃分析器”的一部分。
因此,匹配的单词将完全相同,没有任何同义词。
在这种情况下,唯一的问题是如何计算相当准确的相似度百分比。
非常感谢您的帮助。
I have following situation:
String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically";
String b = "Web Crawler computer program browses the World Wide Web";
Is there any idea or standard algorithm to calculate the percentage of similarity?
For instance, above case, the similarity estimated by manual looking should be 90%++.
My idea is to tokenize both Strings and compare the number of tokens matched. Something like
(7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....
Can anyone give some guidelines???
Above is part of my project, Plagiarism Analyzer.
Hence, the words matched will be exactly same without any synonyms.
The only matters in this case is that how to calculate a quite accurate percentage of similarity.
Thanks a lot for any helps.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
正如康拉德指出的,你的问题很大程度上取决于你所说的“相似”是什么意思。
一般来说,我想说以下指南应该有用:
您可以改进该技术以包括单词形式、句子词序、同义词列表之间的差异虽然你永远不会得到完美的结果,但你有很多调整的可能性,我相信一般来说你可能会得到非常有价值的相似性度量。
As Konrad pointed out, your question depends heavily on what you mean by "similar".
In general, I would say the following guidelines should be of use:
You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.
这取决于您对相似性的看法。正式而言,您需要定义一个您认为“相似”字符串的指标,以便对它们应用统计信息。通常,这是通过假设问题来完成的:“第一个字符串是第一个字符串的修改版本,其中引入了错误(例如通过键入错误)的可能性有多大?”
对于这种相似性(或者更确切地说,相反)的一个非常简单但有效的衡量方法是编辑距离可以使用动态编程计算的两个字符串,通常需要时间 O(nm),其中 n 和 m< /em> 是字符串的长度。
根据您的使用情况,更详细的衡量标准(或完全不相关的衡量标准,例如 soundex 指标)可能需要。
就您而言,如果您直接应用令牌匹配(即仅字数统计),您将永远得到 > > 。 90%相似度。要以有意义的方式获得如此高的相似性需要高级语义分析。如果你完成了这项工作,请发表论文,因为这在很大程度上仍然是一个尚未解决的问题。
That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”
A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(nm) in general, where n and m are the lengths of the strings.
Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.
In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.
我赞同康拉德·鲁道夫已经说过的话。
其他人可能会推荐不同的距离度量。我接下来要说的内容与这些相关,但更多地关注匹配语义的问题。
鉴于您似乎正在寻找的内容,我建议您应用一些标准文本处理方法。所有这些都有潜在的缺点,所以我按照应用和做好
I second what Konrad Rudolf has already said.
Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.
Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well
这个问题的问题是:相似性可能是人性化相似性(正如你所说的“+- 90%相似性”)或统计相似性(康德拉德·鲁道夫的答案)。
人类的相似度永远无法轻易计算出来:比如这三个词,
统计相似度很低,但实际上却很相似。因此:解决这个问题很难,我唯一能指出的是 贝叶斯过滤或人工智能与贝叶斯网络。
The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).
The human-similarity can never be easily calculated: for instance these three words
The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.
一种常见的度量是编辑距离,它是字符串编辑距离的一种特殊情况。它也包含在 apache string util 图书馆
One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library
最长公共子序列是众所周知的字符串不相似度度量,它是在动态规划中实现
The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming