我想相互比较几个字符串,并找到最相似的字符串。 我想知道是否有任何库、方法或最佳实践可以返回哪些字符串与其他字符串更相似。 例如:
- “狐狸跳得很快”-> “狐狸跳了”
- “狐狸跳得快”-> “狐狸”
这个比较会返回第一个比第二个更相似。
我想我需要一些方法,例如:
double similarityIndex(String s1, String s2)
某处有这样的东西吗?
编辑:我为什么要这样做? 我正在编写一个脚本,将 MS Project 文件的输出与处理任务的某些遗留系统的输出进行比较。 由于旧系统的字段宽度非常有限,因此在添加值时,描述会被缩写。 我想要一些半自动的方法来查找 MS Project 中的哪些条目与系统上的条目相似,这样我就可以获得生成的密钥。 它有缺点,因为仍然必须手动检查,但它会节省很多工作
I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:
- "The quick fox jumped" -> "The fox jumped"
- "The quick fox jumped" -> "The fox"
This comparison would return that the first is more similar than the second.
I guess I need some method such as:
double similarityIndex(String s1, String s2)
Is there such a thing somewhere?
EDIT: Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work
发布评论
评论(12)
您还可以使用 z 算法来查找字符串中的相似性。 单击此处 https://teakrunch.com/2020/05/09 /字符串相似度-hackerrank-挑战/
You can also use z algorithm to find similarity in the string. Click here https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/
许多库中使用的以 0%-100% 的方式计算两个字符串之间的相似度的常见方法是测量必须更改较长字符串的程度(以 % 为单位)将其变成更短的:
计算
editDistance()
:上面的
editDistance()
函数预计会计算编辑距离两根弦之间。 此步骤有多种实现,每种实现都可能更适合特定场景。 最常见的是Levenshtein 距离算法 我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能会表现更好)。这里有两个计算编辑距离的选项:
apply(CharSequence left, CharSequence rightt)
工作示例:
在此处查看在线演示。
输出:
The common way of calculating the similarity between two strings in a 0%-100% fashion, as used in many libraries, is to measure how much (in %) you'd have to change the longer string to turn it into the shorter:
Computing the
editDistance()
:The
editDistance()
function above is expected to calculate the edit distance between the two strings. There are several implementations to this step, each may suit a specific scenario better. The most common is the Levenshtein distance algorithm and we'll use it in our example below (for very large strings, other algorithms are likely to perform better).Here's two options to calculate the edit distance:
apply(CharSequence left, CharSequence rightt)
Working example:
See online demo here.
Output:
是的,有许多记录良好的算法,例如:
一个很好的总结(“Sam 的字符串指标”) 可以在这里找到(原始链接已失效,因此它链接到互联网档案馆)
另请检查这些项目:
Yes, there are many well documented algorithms like:
A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)
Also check these projects:
我将 Levenshtein 距离算法 翻译成 JavaScript:
I translated the Levenshtein distance algorithm into JavaScript:
确实有很多字符串相似性度量:
你可以在这里找到这些的解释和java实现:
https://github.com/tdebatty/java-string-similarity
There are indeed a lot of string similarity measures out there:
You can find explanation and java implementation of these here:
https://github.com/tdebatty/java-string-similarity
您可以使用 apache commons 文本库 来实现此目的。 看一下其中的两个类:
上述已弃用版本:
apache commons java 库 -> getLevenshteinDistance getFuzzyDistance
You can achieve this using the apache commons text library. Take a look at these two classes within it:
Deprecated version of the above:
apache commons java library -> getLevenshteinDistance getFuzzyDistance
您可以使用编辑距离来计算两个字符串之间的差异。
http://en.wikipedia.org/wiki/Levenshtein_distance
You could use Levenshtein distance to calculate the difference between two strings.
http://en.wikipedia.org/wiki/Levenshtein_distance
感谢第一个回答者,我认为computeEditDistance(s1, s2)有2次计算。 由于花费大量时间,决定提高代码的性能。 所以:
Thank to the first answerer, I think there are 2 calculations of computeEditDistance(s1, s2). Due to high time spending of it, decided to improve the code's performance. So:
理论上,您可以比较编辑距离。
Theoretically, you can compare edit distances.
这通常是使用编辑距离度量来完成的。 搜索“edit distance java”会出现许多库,例如这个。
This is typically done using an edit distance measure. Searching for "edit distance java" turns up a number of libraries, like this one.
如果你的字符串变成了文档,听起来就像是抄袭查找器 。 也许用这个词搜索会发现一些好东西。
《集体智能编程》有一章是关于判断两个文档是否相似的。 代码是用 Python 编写的,但它很干净且易于移植。
Sounds like a plagiarism finder to me if your string turns into a document. Maybe searching with that term will turn up something good.
"Programming Collective Intelligence" has a chapter on determining whether two documents are similar. The code is in Python, but it's clean and easy to port.
您可以在没有任何库的情况下使用此“Levenshtein Distance”算法:
从这里
You can use this "Levenshtein Distance" algorithm without any library:
From Here