使用 PHP Levenshtein 比较 5000 个字符串

发布于 2024-08-15 12:29:31 字数 195 浏览 20 评论 0原文

我的数组中有 5000 个(有时更多)街道地址字符串。我想将它们与 levenshtein 进行比较,以找到类似的匹配项。我怎样才能做到这一点而不循环遍历所有 5000 个并将它们直接与其他 4999 个进行比较?

编辑:如果有人有建议,我也对替代方法感兴趣。总体目标是根据用户提交的街道地址找到相似的条目(并消除重复项)。

I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?

Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

人心善变 2024-08-22 12:29:31

我认为对相似地址进行分组的更好方法是:

  1. 创建一个包含两个表的数据库 - 一个用于地址(和一个 ID),一个用于地址中单词或文字数字的发音(带有外键)地址表的)

  2. 大写地址,用空格替换 [AZ] 或 [0-9] 以外的任何内容

  3. 分割按空格分割地址,计算每个“单词”的 soundex,保留任何仅包含数字的内容,并将其与您开始的地址的外键一起存储在 soundexes 表中

  4. 为每个地址(带有 id $target)找到最相似的地址:

    SELECT 相似.id、相似.地址、计数(*) 
    FROM 地址相似,字 cmp,字 src
    WHERE src.address_id=$target
    AND src.soundex=cmp.soundex
    AND cmp.address_id=similar.id
    按计数排序(*)
    限制$some_value;
    
  5. 计算源地址与查询返回的前几个值之间的levenstein差异。

(在数据库中对大型数组执行任何类型的操作通常更快)

I think a better way to group similar addresses would be to:

  1. create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)

  2. uppercase the address, replace anything other than [A-Z] or [0-9] with a space

  3. split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

  4. for each address (with id $target) find the most similar addresses:

    SELECT similar.id, similar.address, count(*) 
    FROM adress similar, word cmp, word src
    WHERE src.address_id=$target
    AND src.soundex=cmp.soundex
    AND cmp.address_id=similar.id
    ORDER BY count(*)
    LIMIT $some_value;
    
  5. calculate the levenstein difference between your source address and the top few values returned by the query.

(doing any sort of operation on large arrays is often faster in databases)

听闻余生 2024-08-22 12:29:31

我认为您无法避免循环遍历数组,因为 levenstein() 函数仅接受字符串而不是数组作为输入。

你可以这样做:

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
        $lev = levenshtein($array[$i],$array[$j]);
        if($lev == 0)
        {
            // exact match
        }
        else if($lev <= THRESHOLD)
        {
            // similar
        }
    }
}

I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.

You can do something like:

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
        $lev = levenshtein($array[$i],$array[$j]);
        if($lev == 0)
        {
            // exact match
        }
        else if($lev <= THRESHOLD)
        {
            // similar
        }
    }
}
拍不死你 2024-08-22 12:29:31

您可以使用 bk-tree 来加快搜索/比较速度。

http://blog.notdot.net/ 2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees 说:

现在我们可以对 Levenshtein 距离进行一个特别有用的观察:它形成了一个度量空间。
[...]
假设我们有两个参数,query,我们在搜索中使用的字符串,以及 n 字符串与查询之间的最大距离,但仍会返回。假设我们采用任意字符串,测试并将其与查询进行比较。将所得距离称为 d。因为我们知道三角不等式成立,所以我们所有的结果与测试的距离必须最多为 d+n,且至少为 dn。
[...]
测试表明,使用距离为 1 的搜索查询不超过树的 5-8%,使用两个错误搜索的查询不超过树的 17-25% - 与检查每个节点相比有了很大的改进!

编辑:但这并不能帮助您解决(“12 Bird Road, Apt 6”和“12 Bird Rd.#6”)问题。仅与您的强力 m*n 比较。

You can use a bk-tree to speed-up the search/comparison.

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:

Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space.
[...]
Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.
[...]
Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node!

edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.

醉态萌生 2024-08-22 12:29:31

由于 Levenshtein 算法的性质(特别是它是两个字符串之间的比较),我不明白这是怎么可能的。

您当然可以通过首先执行一些基本匹配要求来减少比较次数,但这超出了您要求的范围。

作为一个(很可能不相关的)选项,您始终可以使用诸如 soundex 之类的东西,它可以让您预先计算字符串值。 (我相信你也可以直接在 MySQL 中使用它。)

Due to the nature of the Levenshtein algorithm (specifically the fact that it's a comparision between two strings), I can't see how this is possible.

You could of course reduce the number of comparisons by performing some basic matching requirements first, but this is out of the scope of what you're asking.

As a (quite possibly irrelevant) option, you could always use something like soundex which would let you pre-compute the string values. (You can also use it directly in MySQL I believe.)

无名指的心愿 2024-08-22 12:29:31

您可以根据 soundexes 对它们进行分组,然后将比较限制为最近的 N 个案例...

 $mashed=array();
 foreach ($address as $key=>$val) {
      $mashed[$key]=soundex($val);
 }
 sort($mashed);

然后迭代 $mashed 的键。

C.

You could group them based on soundexes then limit the comparisons to the nearest N cases...

 $mashed=array();
 foreach ($address as $key=>$val) {
      $mashed[$key]=soundex($val);
 }
 sort($mashed);

Then iterate through the keys of $mashed.

C.

纸短情长 2024-08-22 12:29:31

如果您想找到所有相似值,则必须将所有项目与所有其他项目进行比较。但选择正确的数组函数会显着加快速度。这是一个简单的例子(结果数组本来可以更好):

$results = array();
$count = count($entries);
while ($count != 0) {
    # The entry to process
    $entry = array_shift($entries);
    # Get levenshtein distances to all others
    $result = array_map(
        'levenshtein',
        # array_map() needs two arrays, this one is an array consisting of
        # multiple entries of the value that we are processing
        array_fill($entry, 0, $count),
        $toCompare
    );
    $results[] = array($entry => $result);
    $count--;
}

If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):

$results = array();
$count = count($entries);
while ($count != 0) {
    # The entry to process
    $entry = array_shift($entries);
    # Get levenshtein distances to all others
    $result = array_map(
        'levenshtein',
        # array_map() needs two arrays, this one is an array consisting of
        # multiple entries of the value that we are processing
        array_fill($entry, 0, $count),
        $toCompare
    );
    $results[] = array($entry => $result);
    $count--;
}
梦罢 2024-08-22 12:29:31

考虑到您的问题,如果您想使用 Lehvenstein,我认为除了将每个地址与其他地址进行比较之外别无选择距离。

首先,您应该规范地址,去掉缩写等。

  • Ave ->大道路
  • 。 ->道路

对于相似的地址,您可以有一些固定的最大 Lehvenstein 距离 (N)。

如果是这样,当您确定当前地址对的编辑距离大于 N 时,您可以中止 Lehvenstein 算法。为此,您需要编写 Lehvenstein 算法的自定义版本。这将使算法更快一些。

还有一些相关的琐碎优化。例如:如果地址 A 的长度为 10 个字符,地址 B 的长度为 20 个字符,并且您认为 Lehvenstein 距离小于 8 的地址是相似的。您可以查看地址的长度并立即判断它们不相似。

Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.

First of all, you should normalize addressess, get rid of abbreviations etc.

  • Ave -> Avenue
  • Rd. -> Road

You could have some fixed max Lehvenstein distance ( N ) for similar addresses.

If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.

There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.

尐偏执 2024-08-22 12:29:31
$stringA = "this is php programming language";
$stringB = "this is complete programming script in which java php and  all other minor languages include";

echo "string 1---->".$stringA."<br />";
echo "string 2---->".$stringB."<br />";
// changing string to arrays
$array1 = explode(' ', $stringA);
$array2 = explode(' ', $stringB);

// getting same element from two strings
$c = array_intersect($array1, $array2);
// changing array to the string
$d=implode(' ',$c);

echo "string same elements---> ".$d."<br />";


// getting difrent  element from two arrays
$result = array_diff($array2, $array1);
// changing array to the string
$zem = implode(' ',$result);

if (!empty($zem)) {
  echo "string diffrence---> ".$zem."<br />";
} else {
  echo "string diffrence--->both strings are same <br />";
}

similar_text($stringA, $d, $p);
echo " similarity between the string is ".$p."% <br />";
$stringA = "this is php programming language";
$stringB = "this is complete programming script in which java php and  all other minor languages include";

echo "string 1---->".$stringA."<br />";
echo "string 2---->".$stringB."<br />";
// changing string to arrays
$array1 = explode(' ', $stringA);
$array2 = explode(' ', $stringB);

// getting same element from two strings
$c = array_intersect($array1, $array2);
// changing array to the string
$d=implode(' ',$c);

echo "string same elements---> ".$d."<br />";


// getting difrent  element from two arrays
$result = array_diff($array2, $array1);
// changing array to the string
$zem = implode(' ',$result);

if (!empty($zem)) {
  echo "string diffrence---> ".$zem."<br />";
} else {
  echo "string diffrence--->both strings are same <br />";
}

similar_text($stringA, $d, $p);
echo " similarity between the string is ".$p."% <br />";
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文