Java中如何比较几乎相似的字符串? (弦距测量)

发布于 2024-08-18 22:51:49 字数 1539 浏览 11 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

想你只要分分秒秒 2024-08-25 22:51:49

以下 Java 库提供多种比较算法(Levenshtein、Jaro Winkler 等):

  1. Apache Commons Lang 3https://commons.apache.org/proper/commons-lang/
  2. Simmetrics: http://sourceforge.net/projects/simmetrics/

两个库都有一个 java 文档 (Apache Commons Lang Javadoc,Simmetrics Javadoc)。

//Usage of Apache Commons Lang 3
import org.apache.commons.lang3.StringUtils;   
public double compareStrings(String stringA, String stringB) {
    return StringUtils.getJaroWinklerDistance(stringA, stringB);
}

 //Usage of Simmetrics
import uk.ac.shef.wit.simmetrics.similaritymetrics.JaroWinkler    
public double compareStrings(String stringA, String stringB) {
    JaroWinkler algorithm = new JaroWinkler();
    return algorithm.getSimilarity(stringA, stringB);
}

The following Java libraries offer multiple compare algorithms (Levenshtein,Jaro Winkler,...):

  1. Apache Commons Lang 3: https://commons.apache.org/proper/commons-lang/
  2. Simmetrics: http://sourceforge.net/projects/simmetrics/

Both libraries have a java documentation (Apache Commons Lang Javadoc,Simmetrics Javadoc).

//Usage of Apache Commons Lang 3
import org.apache.commons.lang3.StringUtils;   
public double compareStrings(String stringA, String stringB) {
    return StringUtils.getJaroWinklerDistance(stringA, stringB);
}

 //Usage of Simmetrics
import uk.ac.shef.wit.simmetrics.similaritymetrics.JaroWinkler    
public double compareStrings(String stringA, String stringB) {
    JaroWinkler algorithm = new JaroWinkler();
    return algorithm.getSimilarity(stringA, stringB);
}
我的奇迹 2024-08-25 22:51:49

Levensthein 距离是衡量字符串相似程度的指标。或者,更准确地说,需要进行多少次修改才能使它们相同。

算法在维基百科上以伪代码形式提供。将其转换为 Java 应该不是什么大问题,但它没有内置到基类库中。

Wikipedia 还有一些衡量字符串相似度的算法。

The Levensthein distance is a measure for how similar strings are. Or, more precisely, how many alterations have to be made that they are the same.

The algorithm is available in pseudo-code on Wikipedia. Converting that to Java shouldn't be much of a problem, but it's not built-in into the base class library.

Wikipedia has some more algorithms that measure similarity of strings.

轻许诺言 2024-08-25 22:51:49

是的,这是一个很好的指标,您可以使用 StringUtil.getLevenshteinDistance() 来自 apache commons

yeah thats a good metric, you could use StringUtil.getLevenshteinDistance() from apache commons

凉月流沐 2024-08-25 22:51:49

您可以在以下位置找到 Levenshtein 和其他字符串相似性/距离度量的实现
https://github.com/tdebatty/java-string-similarity

如果您的项目使用maven,安装很简单 然后

<dependency>
  <groupId>info.debatty</groupId>
  <artifactId>java-string-similarity</artifactId>
  <version>RELEASE</version>
</dependency>

,以使用 Levenshtein 为例

import info.debatty.java.stringsimilarity.*;

public class MyApp {

  public static void main (String[] args) {
    Levenshtein l = new Levenshtein();

    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
  }
}

You can find implementations of Levenshtein and other string similarity/distance measures on
https://github.com/tdebatty/java-string-similarity

If your project uses maven, installation is as simple as

<dependency>
  <groupId>info.debatty</groupId>
  <artifactId>java-string-similarity</artifactId>
  <version>RELEASE</version>
</dependency>

Then, to use Levenshtein for example

import info.debatty.java.stringsimilarity.*;

public class MyApp {

  public static void main (String[] args) {
    Levenshtein l = new Levenshtein();

    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
  }
}
不交电费瞎发啥光 2024-08-25 22:51:49

无耻的插件,但我也写了一个库:

https://github.com/vickumar1981/stringdistance

它具有所有这些功能,再加上一些语音相似性功能(如果一个单词“听起来像”另一个单词 - 返回 true 或 false,这与其他模糊相似性(0-1 之间的数字)不同)。

还包括 DNA 测序算法,例如 Smith-Waterman 和 Needleman-Wunsch,它们是 Levenshtein 的通用版本。

我计划在不久的将来使其适用于任何数组,而不仅仅是字符串(字符数组)。

Shameless plug, but I wrote a library also:

https://github.com/vickumar1981/stringdistance

It has all these functions, plus a few for phonetic similarity (if one word "sounds like" another word - returns either true or false unlike the other fuzzy similarities which are numbers between 0-1).

Also includes dna sequencing algorithms like Smith-Waterman and Needleman-Wunsch which are generalized versions of Levenshtein.

I plan, in the near future, on making this work with any array and not just strings (an array of characters).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文