如何使用编辑距离为相似字符串创建阈值并考虑拼写错误？

发布于 2024-09-11 08:40:37 字数 296 浏览 13 评论 0原文

我们最近在工作中遇到了一个有趣的问题，我们发现数据库中存在重复的用户提交数据。我们意识到大部分数据之间的编辑距离只是所讨论的两个字符串之间的差异。这表明，如果我们简单地将一个字符串中的字符添加到另一个字符串中，那么我们最终会得到相同的字符串，并且对于大多数情况来说，这似乎是我们解释重复项的最佳方法。

我们还想解决拼写错误。因此，我们开始考虑人们在网上平均每个单词出现拼写错误的频率，并尝试在这个距离内使用该数据。我们找不到任何此类统计数据。

在为数据匹配创建此类阈值时，有什么方法可以解决拼写错误吗？

如果我能澄清的话请告诉我！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

太阳哥哥 2024-09-18 08:40:37

首先，编辑距离定义为将字符串 A 转换为字符串 B 所需的最小编辑次数，其中编辑是插入或删除单个字符，或者用另一个字符替换一个字符。因此，对于距离的某个定义来说，这很大程度上是“两个字符串之间的差异”。 =)

听起来您正在寻找一个距离函数 F(A, B)，它给出字符串 A 和 B 之间的距离以及阈值 N，其中彼此距离小于 N 的字符串是拼写错误的候选者。除了 Levenshtein 距离之外，您还可以考虑 Needleman–Wunsch。它基本上是相同的东西，但它可以让您提供一个函数来确定给定角色与另一个角色的接近程度。您可以将该算法与一组反映 QWERTY 键盘上按键位置的权重结合使用，从而很好地查找拼写错误。但这对于国际键盘来说会有问题。

如果你有 k 个字符串并且你想找到潜在的拼写错误，那么你需要进行的比较次数是 O(k^2)。另外，每次比较的时间复杂度为O(len(A)*len(B))。因此，如果你有一百万根字符串，如果你天真地做事，你就会发现自己陷入麻烦。以下是关于如何加快速度的一些建议：

如果这是显而易见的，我们深表歉意，但编辑距离是对称的，因此请确保您没有计算 F(A, B) 和 F(B, A)。
abs(len(A) - len(B)) 是字符串 A 和 B 之间距离的下界。因此，您可以跳过检查长度相差太大的字符串。

您可能遇到的一个问题是“第一街”。与“第一街”的距离相当远，尽管您可能希望将它们视为相同。处理此问题的最简单方法可能是在进行比较之前将字符串转换为规范形式。因此，您可以将所有字符串设为小写，使用将“1st”映射到“first”的字典，等等。该字典可能会变得相当大，但我不知道有更好的方法来处理这个问题。

既然您用 php 标记了这个问题，我假设您想使用 php 来解决这个问题。 PHP 有一个内置的 levenshtein() 函数，但两个字符串都必须是 255 个字符或更少。如果时间不够长，您就必须自己制作。或者，您可以使用 Python 的 difflib 进行研究。

First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)

It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.

If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:

Apologies if this is obvious, but Levenshtein distance is symmetrical, so make sure you aren't computing F(A, B) and F(B, A).
abs(len(A) - len(B)) is a lower bound on the distance between strings A and B. So you can skip checking strings whose lengths are too different.

One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.

Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.

回复收藏 0 原文