获取最接近的字符串匹配
我需要一种方法来将多个字符串与测试字符串进行比较并返回与其非常相似的字符串:(
TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW
CHOICE A : THE RED COW JUMPED OVER THE GREEN CHICKEN
CHOICE B : THE RED COW JUMPED OVER THE RED COW
CHOICE C : THE RED FOX JUMPED OVER THE BROWN COW
如果我正确执行此操作)与“TEST STRING”最接近的字符串应该是“CHOICE C”。做到这一点最简单的方法是什么?
我计划将其实现为多种语言,包括 VB.net、Lua 和 JavaScript。此时,伪代码是可以接受的。如果您可以提供特定语言的示例,我们也将不胜感激!
I need a way to compare multiple strings to a test string and return the string that closely resembles it:
TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW
CHOICE A : THE RED COW JUMPED OVER THE GREEN CHICKEN
CHOICE B : THE RED COW JUMPED OVER THE RED COW
CHOICE C : THE RED FOX JUMPED OVER THE BROWN COW
(If I did this correctly) The closest string to the "TEST STRING" should be "CHOICE C". What is the easiest way to do this?
I plan on implementing this into multiple languages including VB.net, Lua, and JavaScript. At this point, pseudo code is acceptable. If you can provide an example for a specific language, this is appreciated too!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
大约一年前,当我在杂项信息数据库中查找用户输入的有关石油钻井平台的信息时,我遇到了这个问题。目标是进行某种模糊字符串搜索,可以识别具有最常见元素的数据库条目。
研究的一部分涉及实施 Levenshtein 距离 算法,该算法确定必须对字符串或短语将其转换为另一个字符串或短语。
我的实现比较简单,就是对两个短语的长度、每个短语之间的变化次数以及每个单词是否可以在目标条目中找到进行加权比较。
这篇文章位于私人网站上,因此我会尽力在此处附加相关内容:
模糊字符串匹配是对两个单词或短语的相似性进行类似人类估计的过程。在许多情况下,它涉及识别彼此最相似的单词或短语。本文介绍了模糊字符串匹配问题的内部解决方案及其在解决各种问题中的有用性,这些问题可以使我们自动执行以前需要繁琐的用户参与的任务。
简介
模糊字符串匹配的需求最初是在开发墨西哥湾验证器工具时出现的。现有的是墨西哥湾已知石油钻井平台和平台的数据库,购买保险的人会向我们提供一些有关其资产的错误输入信息,我们必须将其与已知平台的数据库进行匹配。当提供的信息非常少时,我们能做的最好的事情就是依靠承销商来“识别”他们所指的信息并调用正确的信息。这就是这个自动化解决方案派上用场的地方。
我花了一天时间研究模糊字符串匹配的方法,最终偶然发现了维基百科上非常有用的编辑距离算法。
实现
在阅读了其背后的理论后,我实现了它并找到了优化它的方法。这就是我的代码在 VBA 中的样子:
简单、快速且非常有用的指标。使用它,我创建了两个单独的指标来评估两个字符串的相似性。一种我称之为“valuePhrase”,另一种我称之为“valueWords”。 valuePhrase 只是两个短语之间的 Levenshtein 距离,valueWords 根据分隔符(例如空格、破折号和任何您想要的其他内容)将字符串拆分为单个单词,并将每个单词与其他单词进行比较,总结出最短的单词连接任意两个单词的编辑距离。本质上,它衡量一个“短语”中的信息是否真正包含在另一个“短语”中,就像按单词排列一样。我花了几天时间作为一个业余项目,想出了根据分隔符分割字符串的最有效方法。
valueWords、valuePhrase 和 Split 函数:
相似度测量
使用这两个指标和第三个指标(仅计算两个字符串之间的距离),我有一系列变量,我可以运行优化算法来实现这些变量最多的比赛数。模糊字符串匹配本身就是一门模糊科学,因此通过创建线性独立的度量来测量字符串相似性,并拥有一组我们希望相互匹配的已知字符串,我们可以找到适合我们特定风格的参数。字符串,给出最佳的模糊匹配结果。
最初,该指标的目标是为精确匹配提供较低的搜索值,并为日益排列的度量增加搜索值。在不切实际的情况下,使用一组明确定义的排列来定义这一点相当容易,并设计最终公式,以便它们具有根据需要增加的搜索值结果。
在上面的屏幕截图中,我调整了我的启发式方法,得出了一些我认为可以很好地适应我的感知的东西搜索词和结果之间的差异。我在上述电子表格中用于
Value Phrase
的启发式是=valuePhrase(A2,B2)-0.8*ABS(LEN(B2)-LEN(A2))
。我有效地将 Levenstein 距离的惩罚减少了两个“短语”长度差异的 80%。这样,具有相同长度的“短语”将受到全部惩罚,但包含“附加信息”(更长)但除此之外仍然大部分共享相同字符的“短语”将受到减少的惩罚。我按原样使用了Value Words
函数,然后我的最终SearchVal
启发式定义为=MIN(D2,E2)*0.8+MAX(D2,E2) )*0.2
- 加权平均值。两个分数中较低者占80%,较高者占20%。这只是一个适合我的用例以获得良好匹配率的启发式方法。然后可以调整这些权重以获得与其测试数据的最佳匹配率。正如您所看到的,最后两个指标(模糊字符串匹配指标)已经自然倾向于为要匹配的字符串(沿对角线)提供低分。这很好。
应用
为了优化模糊匹配,我对每个指标进行加权。因此,模糊字符串匹配的每个应用都可以对参数进行不同的加权。定义最终得分的公式是指标及其权重的简单组合:
使用优化算法(神经网络在这里是最好的,因为它是一个离散的多维问题),现在的目标是最大化匹配数量。我创建了一个函数来检测每组彼此之间正确匹配的数量,如最终屏幕截图所示。如果为要匹配的字符串分配了最低分数,则列或行将获得一分;如果最低分数相同,则给出部分分数,并且正确的匹配位于并列的匹配字符串中。然后我优化了它。您可以看到绿色单元格是与当前行最匹配的列,单元格周围的蓝色方块是与当前列最匹配的行。底角的分数大致是成功匹配的数量,这就是我们告诉优化问题最大化的分数。
该算法取得了巨大的成功,解决方案参数充分说明了此类问题。您会注意到优化分数是 44,最佳分数是 48。最后的 5 列是诱饵,与行值根本没有任何匹配。诱饵越多,自然就越难找到最佳匹配。
在这个特定的匹配情况下,字符串的长度是无关紧要的,因为我们期望缩写代表更长的单词,所以长度的最佳权重是-0.3,这意味着我们不会惩罚长度不同的字符串。我们根据这些缩写来降低分数,为部分单词匹配提供更多空间来取代非单词匹配,因为字符串较短,因此只需要较少的替换。
单词权重为 1.0,而短语权重仅为 0.5,这意味着我们会惩罚从一个字符串中丢失的整个单词,并且更看重整个短语的完整。这很有用,因为许多这些字符串都有一个共同词(危险),其中真正重要的是是否维持组合(区域和危险)。
最后,最小权重优化为 10,最大权重优化为 1。这意味着,如果两个分数(价值短语和价值词)中的最佳分数不是很好,则匹配会受到很大的惩罚,但我们不这样做不会严重惩罚两个分数中最差的一个。从本质上讲,这强调要求 valueWord 或 valuePhrase 之一具有良好的分数,但不能两者兼而有之。一种“拿走我们能得到的”心态。
这 5 个权重的优化值对所发生的模糊字符串匹配的描述非常有趣。对于完全不同的模糊字符串匹配的实际情况,这些参数有很大不同。到目前为止我已经将它用于 3 个独立的应用程序。
虽然在最终优化中未使用,但建立了一个基准测试表,该表将列与其自身相匹配,以获得对角线下的所有完美结果,并允许用户更改参数来控制分数偏离 0 的速率,并注意搜索短语之间的固有相似性(理论上可以用来抵消结果中的误报)
其他应用
该解决方案有可能用在用户希望计算机系统识别一组字符串中不存在完美匹配的字符串的任何地方。 (就像字符串的近似匹配 vlookup 一样)。
因此,您应该从中得到的是,您可能希望结合使用高级启发式方法(从一个短语中查找另一个短语中的单词、两个短语的长度等)以及 Levenshtein 距离算法的实现。因为确定哪个是“最佳”匹配是一种启发式(模糊)确定 - 您必须为您提出的任何指标提出一组权重来确定相似性。
通过适当的启发式和权重集,您的比较程序将快速做出您本来会做出的决定。
I was presented with this problem about a year ago when it came to looking up user entered information about a oil rig in a database of miscellaneous information. The goal was to do some sort of fuzzy string search that could identify the database entry with the most common elements.
Part of the research involved implementing the Levenshtein distance algorithm, which determines how many changes must be made to a string or phrase to turn it into another string or phrase.
The implementation I came up with was relatively simple, and involved a weighted comparison of the length of the two phrases, the number of changes between each phrase, and whether each word could be found in the target entry.
The article is on a private site so I'll do my best to append the relevant contents here:
Fuzzy String Matching is the process of performing a human-like estimation of the similarity of two words or phrases. In many cases, it involves identifying words or phrases which are most similar to each other. This article describes an in-house solution to the fuzzy string matching problem and its usefulness in solving a variety of problems which can allow us to automate tasks which previously required tedious user involvement.
Introduction
The need to do fuzzy string matching originally came about while developing the Gulf of Mexico Validator tool. What existed was a database of known gulf of Mexico oil rigs and platforms, and people buying insurance would give us some badly typed out information about their assets and we had to match it to the database of known platforms. When there was very little information given, the best we could do is rely on an underwriter to "recognize" the one they were referring to and call up the proper information. This is where this automated solution comes in handy.
I spent a day researching methods of fuzzy string matching, and eventually stumbled upon the very useful Levenshtein distance algorithm on Wikipedia.
Implementation
After reading about the theory behind it, I implemented and found ways to optimize it. This is how my code looks like in VBA:
Simple, speedy, and a very useful metric. Using this, I created two separate metrics for evaluating the similarity of two strings. One I call "valuePhrase" and one I call "valueWords". valuePhrase is just the Levenshtein distance between the two phrases, and valueWords splits the string into individual words, based on delimiters such as spaces, dashes, and anything else you'd like, and compares each word to each other word, summing up the shortest Levenshtein distance connecting any two words. Essentially, it measures whether the information in one 'phrase' is really contained in another, just as a word-wise permutation. I spent a few days as a side project coming up with the most efficient way possible of splitting a string based on delimiters.
valueWords, valuePhrase, and Split function:
Measures of Similarity
Using these two metrics, and a third which simply computes the distance between two strings, I have a series of variables which I can run an optimization algorithm to achieve the greatest number of matches. Fuzzy string matching is, itself, a fuzzy science, and so by creating linearly independent metrics for measuring string similarity, and having a known set of strings we wish to match to each other, we can find the parameters that, for our specific styles of strings, give the best fuzzy match results.
Initially, the goal of the metric was to have a low search value for for an exact match, and increasing search values for increasingly permuted measures. In an impractical case, this was fairly easy to define using a set of well defined permutations, and engineering the final formula such that they had increasing search values results as desired.
In the above screenshot, I tweaked my heuristic to come up with something that I felt scaled nicely to my perceived difference between the search term and result. The heuristic I used for
Value Phrase
in the above spreadsheet was=valuePhrase(A2,B2)-0.8*ABS(LEN(B2)-LEN(A2))
. I was effectively reducing the penalty of the Levenstein distance by 80% of the difference in the length of the two "phrases". This way, "phrases" that have the same length suffer the full penalty, but "phrases" which contain 'additional information' (longer) but aside from that still mostly share the same characters suffer a reduced penalty. I used theValue Words
function as is, and then my finalSearchVal
heuristic was defined as=MIN(D2,E2)*0.8+MAX(D2,E2)*0.2
- a weighted average. Whichever of the two scores was lower got weighted 80%, and 20% of the higher score. This was just a heuristic that suited my use case to get a good match rate. These weights are something that one could then tweak to get the best match rate with their test data.As you can see, the last two metrics, which are fuzzy string matching metrics, already have a natural tendency to give low scores to strings that are meant to match (down the diagonal). This is very good.
Application
To allow the optimization of fuzzy matching, I weight each metric. As such, every application of fuzzy string match can weight the parameters differently. The formula that defines the final score is a simply combination of the metrics and their weights:
Using an optimization algorithm (neural network is best here because it is a discrete, multi-dimentional problem), the goal is now to maximize the number of matches. I created a function that detects the number of correct matches of each set to each other, as can be seen in this final screenshot. A column or row gets a point if the lowest score is assigned the the string that was meant to be matched, and partial points are given if there is a tie for the lowest score, and the correct match is among the tied matched strings. I then optimized it. You can see that a green cell is the column that best matches the current row, and a blue square around the cell is the row that best matches the current column. The score in the bottom corner is roughly the number of successful matches and this is what we tell our optimization problem to maximize.
The algorithm was a wonderful success, and the solution parameters say a lot about this type of problem. You'll notice the optimized score was 44, and the best possible score is 48. The 5 columns at the end are decoys, and do not have any match at all to the row values. The more decoys there are, the harder it will naturally be to find the best match.
In this particular matching case, the length of the strings are irrelevant, because we are expecting abbreviations that represent longer words, so the optimal weight for length is -0.3, which means we do not penalize strings which vary in length. We reduce the score in anticipation of these abbreviations, giving more room for partial word matches to supersede non-word matches that simply require less substitutions because the string is shorter.
The word weight is 1.0 while the phrase weight is only 0.5, which means that we penalize whole words missing from one string and value more the entire phrase being intact. This is useful because a lot of these strings have one word in common (the peril) where what really matters is whether or not the combination (region and peril) are maintained.
Finally, the min weight is optimized at 10 and the max weight at 1. What this means is that if the best of the two scores (value phrase and value words) isn't very good, the match is greatly penalized, but we don't greatly penalize the worst of the two scores. Essentially, this puts emphasis on requiring either the valueWord or valuePhrase to have a good score, but not both. A sort of "take what we can get" mentality.
It's really fascinating what the optimized value of these 5 weights say about the sort of fuzzy string matching taking place. For completely different practical cases of fuzzy string matching, these parameters are very different. I've used it for 3 separate applications so far.
While unused in the final optimization, a benchmarking sheet was established which matches columns to themselves for all perfect results down the diagonal, and lets the user change parameters to control the rate at which scores diverge from 0, and note innate similarities between search phrases (which could in theory be used to offset false positives in the results)
Further Applications
This solution has potential to be used anywhere where the user wishes to have a computer system identify a string in a set of strings where there is no perfect match. (Like an approximate match vlookup for strings).
So what you should take from this, is that you probably want to use a combination of high level heuristics (finding words from one phrase in the other phrase, length of both phrases, etc) along with the implementation of the Levenshtein distance algorithm. Because deciding which is the "best" match is a heuristic (fuzzy) determination - you'll have to come up with a set of weights for any metrics you come up with to determine similarity.
With the appropriate set of heuristics and weights, you'll have your comparison program quickly making the decisions that you would have made.
这个问题在生物信息学中经常出现。上面接受的答案(顺便说一句,这很好)在生物信息学中被称为 Needleman-Wunsch(比较两个字符串)和 Smith-Waterman(在较长字符串中查找近似子字符串)算法。它们工作出色,几十年来一直是主力。
但是如果要比较一百万个字符串怎么办?这是一万亿次成对比较,每个比较的时间复杂度为 O(n*m)!现代DNA测序仪可以轻松生成十亿个短DNA序列,每个序列大约有200个DNA“字母”长。通常,我们希望为每个这样的字符串找到与人类基因组(30 亿个字母)的最佳匹配。显然,Needleman-Wunsch 算法及其相关算法行不通。
这种所谓的“对齐问题”是一个活跃的研究领域。目前最流行的算法能够在合理的硬件(例如,8 个内核和 32 GB RAM)上在几个小时内找到 10 亿个短字符串与人类基因组之间的不精确匹配。
大多数这些算法的工作原理是快速查找短的精确匹配(种子),然后使用较慢的算法(例如 Smith-Waterman)将它们扩展为完整字符串。这样做的原因是我们实际上只对一些势均力敌的比赛感兴趣,因此消除 99.9...% 没有任何共同点的配对是值得的。
查找精确匹配如何帮助查找不精确匹配?好吧,假设我们只允许查询和目标之间有一个差异。很容易看出,这种差异必须发生在查询的右半部分或左半部分,因此另一半必须完全匹配。这个想法可以扩展到多个不匹配,并且是 Illumina 常用的 ELAND 算法的基础DNA 测序仪。
有许多非常好的算法可以进行精确的字符串匹配。给定长度为 200 的查询字符串和长度为 30 亿的目标字符串(人类基因组),我们希望找到目标中存在与查询子字符串完全匹配的长度为 k 的子字符串的任何位置。一种简单的方法是从对目标进行索引开始:获取所有 k 长子字符串,将它们放入数组中并对它们进行排序。然后取出查询的每个 k 长子串并搜索排序后的索引。
排序和搜索可以在 O(log n) 时间内完成。但存储可能是一个问题。 30 亿个字母目标的索引需要容纳 30 亿个指针和 30 亿个 k 长单词。似乎很难将其放入不到几十 GB 的 RAM 中。但令人惊讶的是,我们可以使用 Burrows-Wheeler 变换 极大地压缩索引,并且它仍然可以有效地查询。人类基因组索引可以容纳在不到 4 GB 的 RAM 中。这个想法是流行的序列比对器的基础,例如 Bowtie 和 BWA。
或者,我们可以使用 后缀数组,它仅存储指针,但表示目标中所有后缀的同时索引字符串(本质上是 k 所有可能值的同时索引;Burrows-Wheeler 变换也是如此)。如果我们使用 32 位指针,人类基因组的后缀数组索引将占用 12 GB RAM。
上面的链接包含大量信息和主要研究论文的链接。 ELAND 链接指向 PDF,其中包含说明所涉及概念的有用图表,并展示了如何处理插入和删除。
最后,虽然这些算法基本上解决了对单个人类基因组(十亿个短字符串)进行(重新)测序的问题,但 DNA 测序技术的进步速度甚至比摩尔定律还要快,而且我们正在快速接近万亿字母的数据集。例如,目前正在进行的项目对10,000 个脊椎动物物种进行基因组测序,每个物种大约有十亿个字母长。当然,我们希望对数据进行成对的不精确字符串匹配......
This problem turns up all the time in bioinformatics. The accepted answer above (which was great by the way) is known in bioinformatics as the Needleman-Wunsch (compare two strings) and Smith-Waterman (find an approximate substring in a longer string) algorithms. They work great and have been workhorses for decades.
But what if you have a million strings to compare? That's a trillion pairwise comparisons, each of which is O(n*m)! Modern DNA sequencers easily generate a billion short DNA sequences, each about 200 DNA "letters" long. Typically, we want to find, for each such string, the best match against the human genome (3 billion letters). Clearly, the Needleman-Wunsch algorithm and its relatives will not do.
This so-called "alignment problem" is a field of active research. The most popular algorithms are currently able to find inexact matches between 1 billion short strings and the human genome in a matter of hours on reasonable hardware (say, eight cores and 32 GB RAM).
Most of these algorithms work by quickly finding short exact matches (seeds) and then extending these to the full string using a slower algorithm (for example, the Smith-Waterman). The reason this works is that we are really only interested in a few close matches, so it pays off to get rid of the 99.9...% of pairs that have nothing in common.
How does finding exact matches help finding inexact matches? Well, say we allow only a single difference between the query and the target. It is easy to see that this difference must occur in either the right or left half of the query, and so the other half must match exactly. This idea can be extended to multiple mismatches and is the basis for the ELAND algorithm commonly used with Illumina DNA sequencers.
There are many very good algorithms for doing exact string matching. Given a query string of length 200, and a target string of length 3 billion (the human genome), we want to find any place in the target where there is a substring of length k that matches a substring of the query exactly. A simple approach is to begin by indexing the target: take all k-long substrings, put them in an array and sort them. Then take each k-long substring of the query and search the sorted index.
Sort andsearch can be done in O(log n) time.But storage can be a problem. An index of the 3 billion letter target would need to hold 3 billion pointers and 3 billion k-long words. It would seem hard to fit this in less than several tens of gigabytes of RAM. But amazingly we can greatly compress the index, using the Burrows-Wheeler transform, and it will still be efficiently queryable. An index of the human genome can fit in less than 4 GB RAM. This idea is the basis of popular sequence aligners such as Bowtie and BWA.
Alternatively, we can use a suffix array, which stores only the pointers, yet represents a simultaneous index of all suffixes in the target string (essentially, a simultaneous index for all possible values of k; the same is true of the Burrows-Wheeler transform). A suffix array index of the human genome will take 12 GB of RAM if we use 32-bit pointers.
The links above contain a wealth of information and links to primary research papers. The ELAND link goes to a PDF with useful figures illustrating the concepts involved, and shows how to deal with insertions and deletions.
Finally, while these algorithms have basically solved the problem of (re)sequencing single human genomes (a billion short strings), DNA sequencing technology improves even faster than Moore's law, and we are fast approaching trillion-letter datasets. For example, there are currently projects underway to sequence the genomes of 10,000 vertebrate species, each a billion letters long or so. Naturally, we will want to do pairwise inexact string matching on the data...
我认为选项 B 更接近测试字符串,因为它与原始字符串只有 4 个字符(和 2 个删除)。而您认为 C 更接近,因为它同时包含棕色和红色。然而,它会有更大的编辑距离。
有一种名为 Levenshtein Distance 的算法,用于测量两个输入之间的编辑距离。
这里是该算法的工具。
编辑:抱歉,我一直在编辑工具中混合字符串。更新为正确答案。
I contest that choice B is closer to the test string, as it's only 4 characters(and 2 deletes) from being the original string. Whereas you see C as closer because it includes both brown and red. It would, however, have a greater edit distance.
There is an algorithm called Levenshtein Distance which measures the edit distance between two inputs.
Here is a tool for that algorithm.
EDIT: Sorry, I keep mixing strings in the levenshtein tool. Updated to correct answers.
Lua 实现,为了后代:
Lua implementation, for posterity:
您可能会发现这个库很有帮助!
http://code.google.com/p/google-diff-match-patch/
它目前可用于 Java、JavaScript、Dart、C++、C#、Objective C、Lua 和 Python
它也运行得很好。我在几个 Lua 项目中使用它。
而且我认为将其移植到其他语言不会太困难!
You might find this library helpful!
http://code.google.com/p/google-diff-match-patch/
It is currently available in Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python
It works pretty well too. I use it in a couple of my Lua projects.
And I don't think it would be too difficult to port it to other languages!
您可能对这篇博文感兴趣。
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy -string-matching-in-python
Fuzzywuzzy 是一个 Python 库,它提供简单的距离测量,例如用于字符串匹配的 Levenshtein 距离。它构建在标准库中的 difflib 之上,并将使用 C 实现 Python-levenshtein(如果可用)。
http://pypi.python.org/pypi/python-Levenshtein/
You might be interested in this blog post.
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
Fuzzywuzzy is a Python library that provides easy distance measures such as Levenshtein distance for string matching. It is built on top of difflib in the standard library and will make use of the C implementation Python-levenshtein if available.
http://pypi.python.org/pypi/python-Levenshtein/
如果您在搜索引擎或数据库前端的上下文中执行此操作,您可以考虑使用诸如 之类的工具Apache Solr,带有 ComplexPhraseQueryParser 插件。这种组合允许您搜索字符串索引,结果按相关性排序(由编辑距离确定)。
当传入的查询可能有一个或多个拼写错误时,我们一直在针对大量艺术家和歌曲标题使用它,并且它的工作效果非常好(考虑到这些集合有数百万个字符串,而且速度非常快)。
此外,使用 Solr,您可以通过 JSON 按需搜索索引,因此您不必在您正在查看的不同语言之间重新发明解决方案。
If you're doing this in the context of a search engine or frontend against a database, you might consider using a tool like Apache Solr, with the ComplexPhraseQueryParser plugin. This combination allows you to search against an index of strings with the results sorted by relevance, as determined by Levenshtein distance.
We've been using it against a large collection of artists and song titles when the incoming query may have one or more typos, and it's worked pretty well (and remarkably fast considering the collections are in the millions of strings).
Additionally, with Solr, you can search against the index on demand via JSON, so you won't have to reinvent the solution between the different languages you're looking at.
如果输入数据太大(比如数百万个字符串),这个问题就很难实现。我使用弹性搜索来解决这个问题。
快速入门:https:// /www.elastic.co/guide/en/elasticsearch/client/net-api/6.x/elasticsearch-net.html
只需将所有输入数据插入数据库,您就可以根据任何编辑搜索任何字符串距离很快。这是一个 C# 代码片段,它将为您提供按编辑距离排序的结果列表(从小到大)
The problem is hard to implement if the input data is too large (say millions of strings). I used elastic search to solve this.
Quick start : https://www.elastic.co/guide/en/elasticsearch/client/net-api/6.x/elasticsearch-net.html
Just insert all the input data into DB and you can search any string based on any edit distance quickly. Here is a C# snippet which will give you a list of results sorted by edit distance (smaller to higher)
Simmetrics 是此类算法的一个非常非常好的资源:http://sourceforge.net/projects/simmetrics/ 不幸的是
,包含大量文档的很棒的网站已经消失了:(
如果它再次出现,它之前的地址是这样的:
http://www.dcs.shef.ac.uk/~sam/simmetrics .html
瞧(由“Wayback Machine”提供):http://web.archive.org/web/20081230184321/http://www.dcs.shef.ac.uk/~sam/simmetrics.html
您可以研究代码源,有数十种用于此类比较的算法,每种算法都有不同的权衡。实现是用 Java 实现的。
A very, very good resource for these kinds of algorithms is Simmetrics: http://sourceforge.net/projects/simmetrics/
Unfortunately the awesome website containing a lot of the documentation is gone :(
In case it comes back up again, its previous address was this:
http://www.dcs.shef.ac.uk/~sam/simmetrics.html
Voila (courtesy of "Wayback Machine"): http://web.archive.org/web/20081230184321/http://www.dcs.shef.ac.uk/~sam/simmetrics.html
You can study the code source, there are dozens of algorithms for these kinds of comparisons, each with a different trade-off. The implementations are in Java.
要以有效的方式查询大量文本,您可以使用编辑距离/前缀编辑距离的概念。
但是计算每个术语和查询文本之间的 ED 是资源和时间密集型的。因此,我们可以使用称为 Qgram 索引的技术提取可能的匹配项,而不是首先计算每个术语的 ED。然后对这些选定的项应用 ED 计算。
Qgram 索引技术的一个优点是它支持模糊搜索。
适应 QGram 索引的一种可能方法是使用 Qgram 构建倒排索引。在那里,我们在该 Qgram 下存储由特定 Qgram 组成的所有单词。(您可以为每个字符串使用唯一的 ID,而不是存储完整的字符串)。为此,您可以使用 Java 中的 Tree Map 数据结构。
以下是存储术语的一个小例子
然后在查询的时候,我们计算查询文本和可用术语之间的常见 Qgram 数量。
共同 q-gram 的数量 = 4。
对于具有大量共同 Qgram 的术语,我们根据查询术语计算 ED/PED,然后向最终用户建议该术语。
您可以在以下项目中找到该理论的实现(参见“QGramIndex.java”)。如有任何问题,请随时提出。 https://github.com/Bhashitha-Gamage/City_Search
要了解有关编辑距离的更多信息,前缀编辑距离 Qgram 索引请观看 Hannah Bast 博士教授的以下视频 https://www.youtube.com/embed/6pUg2wmGJRo(课程从20:06开始)
To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.
But computing ED between each term and query text is resource and time intensive. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.
An advantage of Qgram index technique is it supports for Fuzzy Search.
One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram, under that Qgram.(Instead of storing full string you can use unique ID for each string). You can use Tree Map data structure in Java for this.
Following is a small example on storing of terms
Then when querying, we calculate the number of common Qgrams between query text and available terms.
number of q-grams in common = 4.
For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.
you can find an implementation of this theory in following project(See "QGramIndex.java"). Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search
To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)
在这里你可以有一个 golang POC 来计算给定单词之间的距离。您可以调整其他范围的
minDistance
和difference
。游乐场:https://play.golang.org/p/NtrBzLdC3rE
Here you can have a golang POC for calculate the distances between the given words. You can tune the
minDistance
anddifference
for other scopes.Playground: https://play.golang.org/p/NtrBzLdC3rE
使用 C# 的示例位于此处。
输出是:
A sample using C# is here.
The output is:
我曾经在我们的系统中实现了另一个相似性度量,并给出了令人满意的结果:-
用例
有一个用户查询需要与一组文档进行匹配。
算法
对于从用户查询中提取的每个关键字:-
本质上,如果第一个关键字在文档中出现 4 次,则分数将计算为:-
总相似性得分 = 1 + 1/2 + 1/3 + 1/4 = 2.083
类似地,我们将其计算为用户查询中的其他关键字。
最后,总分将代表用户查询与给定文档之间的相似程度。
There is one more similarity measure which I once implemented in our system and was giving satisfactory results :-
Use Case
There is a user query which needs to be matched against a set of documents.
Algorithm
For every keyword extracted from user query :-
In essence, if first keyword appears 4 times in the document, the score will be calculated as :-
Total similarity score = 1 + 1/2 + 1/3 + 1/4 = 2.083
Similarly, we calculate it for other keywords in user query.
Finally, the total score will represent the extent of similarity between user query and given document.
这是一个不依赖于任何库的快速解决方案,并且对于自动完成表单之类的事情来说效果很好:
可以在自动完成输入中使用,如下所示:
HTML:
CSS:
JS:
结果:
Here is a quick solution that doesn't depend on any libraries, and works well enough for things like autocomplete forms:
Can use in an autocomplete input like this:
HTML:
CSS:
JS:
Result: