当前位置：文江博客话题详情

使用 PHP Levenshtein 比较 5000 个字符串

发布于 2024-08-15 12:29:31 字数 195 浏览 20 评论 0原文

我的数组中有 5000 个（有时更多）街道地址字符串。我想将它们与 levenshtein 进行比较，以找到类似的匹配项。我怎样才能做到这一点而不循环遍历所有 5000 个并将它们直接与其他 4999 个进行比较？

编辑：如果有人有建议，我也对替代方法感兴趣。总体目标是根据用户提交的街道地址找到相似的条目（并消除重复项）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人心善变 2024-08-22 12:29:31

我认为对相似地址进行分组的更好方法是：

创建一个包含两个表的数据库 - 一个用于地址（和一个 ID），一个用于地址中单词或文字数字的发音（带有外键）地址表的）
大写地址，用空格替换 [AZ] 或 [0-9] 以外的任何内容
分割按空格分割地址，计算每个“单词”的 soundex，保留任何仅包含数字的内容，并将其与您开始的地址的外键一起存储在 soundexes 表中

为每个地址（带有 id $target）找到最相似的地址：

SELECT 相似.id、相似.地址、计数(*) 
FROM 地址相似，字 cmp，字 src
WHERE src.address_id=$target
AND src.soundex=cmp.soundex
AND cmp.address_id=similar.id
按计数排序(*)
限制$some_value；

计算源地址与查询返回的前几个值之间的levenstein差异。

（在数据库中对大型数组执行任何类型的操作通常更快）

I think a better way to group similar addresses would be to:

create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)
uppercase the address, replace anything other than [A-Z] or [0-9] with a space
split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

for each address (with id $target) find the most similar addresses:

SELECT similar.id, similar.address, count(*) 
FROM adress similar, word cmp, word src
WHERE src.address_id=$target
AND src.soundex=cmp.soundex
AND cmp.address_id=similar.id
ORDER BY count(*)
LIMIT $some_value;

calculate the levenstein difference between your source address and the top few values returned by the query.

(doing any sort of operation on large arrays is often faster in databases)

回复收藏 0 原文

听闻余生 2024-08-22 12:29:31

我认为您无法避免循环遍历数组，因为 levenstein() 函数仅接受字符串而不是数组作为输入。

你可以这样做：

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
        $lev = levenshtein($array[$i],$array[$j]);
        if($lev == 0)
        {
            // exact match
        }
        else if($lev <= THRESHOLD)
        {
            // similar
        }
    }
}

I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.

You can do something like:

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
        $lev = levenshtein($array[$i],$array[$j]);
        if($lev == 0)
        {
            // exact match
        }
        else if($lev <= THRESHOLD)
        {
            // similar
        }
    }
}

回复收藏 0 原文

拍不死你 2024-08-22 12:29:31

您可以使用 bk-tree 来加快搜索/比较速度。

http://blog.notdot.net/ 2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees 说：

现在我们可以对 Levenshtein 距离进行一个特别有用的观察：它形成了一个度量空间。
[...]
假设我们有两个参数，query，我们在搜索中使用的字符串，以及 n 字符串与查询之间的最大距离，但仍会返回。假设我们采用任意字符串，测试并将其与查询进行比较。将所得距离称为 d。因为我们知道三角不等式成立，所以我们所有的结果与测试的距离必须最多为 d+n，且至少为 dn。
[...]
测试表明，使用距离为 1 的搜索查询不超过树的 5-8%，使用两个错误搜索的查询不超过树的 17-25% - 与检查每个节点相比有了很大的改进！

编辑：但这并不能帮助您解决（“12 Bird Road, Apt 6”和“12 Bird Rd.#6”）问题。仅与您的强力 m*n 比较。

回复收藏 0 原文

醉态萌生 2024-08-22 12:29:31

由于 Levenshtein 算法的性质（特别是它是两个字符串之间的比较），我不明白这是怎么可能的。

您当然可以通过首先执行一些基本匹配要求来减少比较次数，但这超出了您要求的范围。

作为一个（很可能不相关的）选项，您始终可以使用诸如 soundex 之类的东西，它可以让您预先计算字符串值。（我相信你也可以直接在 MySQL 中使用它。）

回复收藏 0 原文

无名指的心愿 2024-08-22 12:29:31

您可以根据 soundexes 对它们进行分组，然后将比较限制为最近的 N 个案例...

 $mashed=array();
 foreach ($address as $key=>$val) {
      $mashed[$key]=soundex($val);
 }
 sort($mashed);

然后迭代 $mashed 的键。

You could group them based on soundexes then limit the comparisons to the nearest N cases...

 $mashed=array();
 foreach ($address as $key=>$val) {
      $mashed[$key]=soundex($val);
 }
 sort($mashed);

Then iterate through the keys of $mashed.

回复收藏 0 原文

纸短情长 2024-08-22 12:29:31

如果您想找到所有相似值，则必须将所有项目与所有其他项目进行比较。但选择正确的数组函数会显着加快速度。这是一个简单的例子（结果数组本来可以更好）：

$results = array();
$count = count($entries);
while ($count != 0) {
    # The entry to process
    $entry = array_shift($entries);
    # Get levenshtein distances to all others
    $result = array_map(
        'levenshtein',
        # array_map() needs two arrays, this one is an array consisting of
        # multiple entries of the value that we are processing
        array_fill($entry, 0, $count),
        $toCompare
    );
    $results[] = array($entry => $result);
    $count--;
}

If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):

$results = array();
$count = count($entries);
while ($count != 0) {
    # The entry to process
    $entry = array_shift($entries);
    # Get levenshtein distances to all others
    $result = array_map(
        'levenshtein',
        # array_map() needs two arrays, this one is an array consisting of
        # multiple entries of the value that we are processing
        array_fill($entry, 0, $count),
        $toCompare
    );
    $results[] = array($entry => $result);
    $count--;
}

回复收藏 0 原文

梦罢 2024-08-22 12:29:31

考虑到您的问题，如果您想使用 Lehvenstein，我认为除了将每个地址与其他地址进行比较之外别无选择距离。

首先，您应该规范地址，去掉缩写等。

Ave ->大道路
。 ->道路

对于相似的地址，您可以有一些固定的最大 Lehvenstein 距离 (N)。

如果是这样，当您确定当前地址对的编辑距离大于 N 时，您可以中止 Lehvenstein 算法。为此，您需要编写 Lehvenstein 算法的自定义版本。这将使算法更快一些。

还有一些相关的琐碎优化。例如：如果地址 A 的长度为 10 个字符，地址 B 的长度为 20 个字符，并且您认为 Lehvenstein 距离小于 8 的地址是相似的。您可以查看地址的长度并立即判断它们不相似。

回复收藏 0 原文

尐偏执 2024-08-22 12:29:31

$stringA = "this is php programming language";
$stringB = "this is complete programming script in which java php and  all other minor languages include";

echo "string 1---->".$stringA."<br />";
echo "string 2---->".$stringB."<br />";
// changing string to arrays
$array1 = explode(' ', $stringA);
$array2 = explode(' ', $stringB);

// getting same element from two strings
$c = array_intersect($array1, $array2);
// changing array to the string
$d=implode(' ',$c);

echo "string same elements---> ".$d."<br />";


// getting difrent  element from two arrays
$result = array_diff($array2, $array1);
// changing array to the string
$zem = implode(' ',$result);

if (!empty($zem)) {
  echo "string diffrence---> ".$zem."<br />";
} else {
  echo "string diffrence--->both strings are same <br />";
}

similar_text($stringA, $d, $p);
echo " similarity between the string is ".$p."% <br />";

$stringA = "this is php programming language";
$stringB = "this is complete programming script in which java php and  all other minor languages include";

echo "string 1---->".$stringA."<br />";
echo "string 2---->".$stringB."<br />";
// changing string to arrays
$array1 = explode(' ', $stringA);
$array2 = explode(' ', $stringB);

// getting same element from two strings
$c = array_intersect($array1, $array2);
// changing array to the string
$d=implode(' ',$c);

echo "string same elements---> ".$d."<br />";


// getting difrent  element from two arrays
$result = array_diff($array2, $array1);
// changing array to the string
$zem = implode(' ',$result);

if (!empty($zem)) {
  echo "string diffrence---> ".$zem."<br />";
} else {
  echo "string diffrence--->both strings are same <br />";
}

similar_text($stringA, $d, $p);
echo " similarity between the string is ".$p."% <br />";

回复收藏 0 原文

~没有更多了~