百分比相似度分析 (Java)

发布于 2024-08-24 06:46:44 字数 399 浏览 12 评论 0原文

我有以下情况:

String a =“网络爬虫是一种自动浏览万维网互联网的计算机程序”; String b = "网络爬虫计算机程序浏览万维网";

有没有任何想法或标准算法来计算相似度百分比?

例如上面的例子,通过人工查看估计的相似度应该是90%++。

我的想法是对两个字符串进行标记并比较匹配的标记数量。像这样的东西 (7 个令牌/1 0 个令牌) * 100。但是,当然,这种方法根本没有效果。比较匹配的字符数似乎也无效......

任何人都可以提供一些指导吗???

以上是我的项目“剽窃分析器”的一部分。

因此,匹配的单词将完全相同,没有任何同义词。

在这种情况下,唯一的问题是如何计算相当准确的相似度百分比。

非常感谢您的帮助。

I have following situation:

String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically";
String b = "Web Crawler computer program browses the World Wide Web";

Is there any idea or standard algorithm to calculate the percentage of similarity?

For instance, above case, the similarity estimated by manual looking should be 90%++.

My idea is to tokenize both Strings and compare the number of tokens matched. Something like
(7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....

Can anyone give some guidelines???

Above is part of my project, Plagiarism Analyzer.

Hence, the words matched will be exactly same without any synonyms.

The only matters in this case is that how to calculate a quite accurate percentage of similarity.

Thanks a lot for any helps.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

離殇 2024-08-31 06:46:44

正如康拉德指出的,你的问题很大程度上取决于你所说的“相似”是什么意思。
一般来说,我想说以下指南应该有用:

  • 通过将单词减少到其基本形式并将其小写来标准化输入,
  • 使用词频列表(可以在网络上轻松获得)并使单词的“相似性相关性”成反比到它在频率列表中的位置
  • 计算句子的总相似度,即两个句子中出现的单词的聚合相似度除以句子的总相似度相关性

您可以改进该技术以包括单词形式、句子词序、同义词列表之间的差异虽然你永远不会得到完美的结果,但你有很多调整的可能性,我相信一般来说你可能会得到非常有价值的相似性度量。

As Konrad pointed out, your question depends heavily on what you mean by "similar".
In general, I would say the following guidelines should be of use:

  • normalize the input by reducing a word to it's base form and lowercase it
  • use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
  • calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences

You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.

老娘不死你永远是小三 2024-08-31 06:46:44

这取决于您对相似性的看法。正式而言,您需要定义一个您认为“相似”字符串的指标,以便对它们应用统计信息。通常,这是通过假设问题来完成的:“第一个字符串是第一个字符串的修改版本,其中引入了错误(例如通过键入错误)的可能性有多大?”

对于这种相似性(或者更确切地说,相反)的一个非常简单但有效的衡量方法是编辑距离可以使用动态编程计算的两个字符串,通常需要时间 O(nm),其中 nm< /em> 是字符串的长度。

根据您的使用情况,更详细的衡量标准(或完全不相关的衡量标准,例如 soundex 指标)可能需要。

就您而言,如果您直接应用令牌匹配(即仅字数统计),您将永远得到 > > 。 90%相似度。要以有意义的方式获得如此高的相似性需要高级语义分析。如果你完成了这项工作,请发表论文,因为这在很大程度上仍然是一个尚未解决的问题。

That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”

A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(nm) in general, where n and m are the lengths of the strings.

Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.

In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.

面如桃花 2024-08-31 06:46:44

我赞同康拉德·鲁道夫已经说过的话。

其他人可能会推荐不同的距离度量。我接下来要说的内容与这些相关,但更多地关注匹配语义的问题。

鉴于您似乎正在寻找的内容,我建议您应用一些标准文本处理方法。所有这些都有潜在的缺点,所以我按照应用和做好

  1. 句子拆分的难度的顺序列出了它们。找出比较单位。
  2. 停用词删除:取出 a、an、the、of 等。
  3. 词袋百分比:总体单词匹配的百分比,与顺序无关
  4. (更加激进)您可以尝试同义词扩展,它将同义词计为匹配单词。

I second what Konrad Rudolf has already said.

Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.

Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well

  1. Sentence splitting. Figure out your units of comparison.
  2. stop-word removal: take out a, an, the, of, etc.
  3. bag of words percentage: what percentage of the overall words match, independent of ordering
  4. (much more aggressive) you could try synonym expansion, which counts synonyms as matched words.
烟花易冷人易散 2024-08-31 06:46:44

这个问题的问题是:相似性可能是人性化相似性(正如你所说的“+- 90%相似性”)或统计相似性(康德拉德·鲁道夫的答案)。

人类的相似度永远无法轻易计算出来:比如这三个词,

cellphone car message

mobile automobile post

统计相似度很低,但实际上却很相似。因此:解决这个问题很难,我唯一能指出的是 贝叶斯过滤或人工智能与贝叶斯网络

The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).

The human-similarity can never be easily calculated: for instance these three words

cellphone car message

mobile automobile post

The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.

走过海棠暮 2024-08-31 06:46:44

一种常见的度量是编辑距离,它是字符串编辑距离的一种特殊情况。它也包含在 apache string util 图书馆

One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library

最佳男配角 2024-08-31 06:46:44

最长公共子序列是众所周知的字符串不相似度度量,它是在动态规划中实现

The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文