使用 PHP Levenshtein 比较 5000 个字符串
我的数组中有 5000 个(有时更多)街道地址字符串。我想将它们与 levenshtein 进行比较,以找到类似的匹配项。我怎样才能做到这一点而不循环遍历所有 5000 个并将它们直接与其他 4999 个进行比较?
编辑:如果有人有建议,我也对替代方法感兴趣。总体目标是根据用户提交的街道地址找到相似的条目(并消除重复项)。
I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?
Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我认为对相似地址进行分组的更好方法是:
创建一个包含两个表的数据库 - 一个用于地址(和一个 ID),一个用于地址中单词或文字数字的发音(带有外键)地址表的)
大写地址,用空格替换 [AZ] 或 [0-9] 以外的任何内容
分割按空格分割地址,计算每个“单词”的 soundex,保留任何仅包含数字的内容,并将其与您开始的地址的外键一起存储在 soundexes 表中
为每个地址(带有 id $target)找到最相似的地址:
计算源地址与查询返回的前几个值之间的levenstein差异。
(在数据库中对大型数组执行任何类型的操作通常更快)
I think a better way to group similar addresses would be to:
create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)
uppercase the address, replace anything other than [A-Z] or [0-9] with a space
split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with
for each address (with id $target) find the most similar addresses:
calculate the levenstein difference between your source address and the top few values returned by the query.
(doing any sort of operation on large arrays is often faster in databases)
我认为您无法避免循环遍历数组,因为 levenstein() 函数仅接受字符串而不是数组作为输入。
你可以这样做:
I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.
You can do something like:
您可以使用 bk-tree 来加快搜索/比较速度。
http://blog.notdot.net/ 2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees 说:
编辑:但这并不能帮助您解决(“12 Bird Road, Apt 6”和“12 Bird Rd.#6”)问题。仅与您的强力 m*n 比较。
You can use a bk-tree to speed-up the search/comparison.
http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:
edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.
由于 Levenshtein 算法的性质(特别是它是两个字符串之间的比较),我不明白这是怎么可能的。
您当然可以通过首先执行一些基本匹配要求来减少比较次数,但这超出了您要求的范围。
作为一个(很可能不相关的)选项,您始终可以使用诸如
soundex
之类的东西,它可以让您预先计算字符串值。 (我相信你也可以直接在 MySQL 中使用它。)Due to the nature of the Levenshtein algorithm (specifically the fact that it's a comparision between two strings), I can't see how this is possible.
You could of course reduce the number of comparisons by performing some basic matching requirements first, but this is out of the scope of what you're asking.
As a (quite possibly irrelevant) option, you could always use something like
soundex
which would let you pre-compute the string values. (You can also use it directly in MySQL I believe.)您可以根据 soundexes 对它们进行分组,然后将比较限制为最近的 N 个案例...
然后迭代 $mashed 的键。
C.
You could group them based on soundexes then limit the comparisons to the nearest N cases...
Then iterate through the keys of $mashed.
C.
如果您想找到所有相似值,则必须将所有项目与所有其他项目进行比较。但选择正确的数组函数会显着加快速度。这是一个简单的例子(结果数组本来可以更好):
If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):
考虑到您的问题,如果您想使用 Lehvenstein,我认为除了将每个地址与其他地址进行比较之外别无选择距离。
首先,您应该规范地址,去掉缩写等。
对于相似的地址,您可以有一些固定的最大 Lehvenstein 距离 (N)。
如果是这样,当您确定当前地址对的编辑距离大于 N 时,您可以中止 Lehvenstein 算法。为此,您需要编写 Lehvenstein 算法的自定义版本。这将使算法更快一些。
还有一些相关的琐碎优化。例如:如果地址 A 的长度为 10 个字符,地址 B 的长度为 20 个字符,并且您认为 Lehvenstein 距离小于 8 的地址是相似的。您可以查看地址的长度并立即判断它们不相似。
Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.
First of all, you should normalize addressess, get rid of abbreviations etc.
You could have some fixed max Lehvenstein distance ( N ) for similar addresses.
If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.
There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.