优化近似重复值搜索

发布于 2024-09-02 11:59:36 字数 2950 浏览 5 评论 0原文

我试图在一组字段中查找接近重复的值，以便管理员可以清理它们。

我匹配的有两个标准

一个字符串完全包含在另一个字符串中，并且至少是其长度的 1/4
字符串的编辑距离小于两个字符串总长度的 5%

伪 PHP code：

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

我尝试尽可能减少对相对昂贵的 stripos 和 levenshtein 函数的调用，这大大减少了执行时间。然而，作为 O(n^2) 操作，这不能扩展到更大的值集，并且似乎大量的处理时间花费在简单地迭代数组上。

正在操作的几组值的一些属性

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

是否还有其他事情可以减少检查条件的时间，更重要的是有什么方法可以减少所需的条件检查的数量（例如，通过预检查） -处理输入值），因为选择性如此低？

编辑：已实施的解决方案

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}

原文

I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up.

There are two criteria that I am matching on

One string is wholly contained within the other, and is at least 1/4 of its length
The strings have an edit distance less than 5% of the total length of the two strings

The Pseudo-PHP code:

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit.
However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays.

Some properties of a few sets of values being operated on

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?

Edit: Implemented solution

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

安静 2024-09-09 11:59:37

您可以首先按长度对字符串进行排序 ( O(N) )，然后仅检查较小的字符串是子字符串还是较大的字符串，并且仅检查差异不太大的字符串对中的 levenshtein。

您已经执行了这些检查，但现在您对所有 N x N 对执行此操作，同时按长度预先选择第一个将帮助您减少首先检查的对。避免 N x N 循环，即使它只包含会失败的测试。

对于子字符串匹配，您可以通过为所有较小的项目创建索引来进一步改进，并在解析较大的项目时相应地更新它。索引应该可以形成以字母为分支的树结构，其中每个单词（字符串）形成从根到叶的路径。这样您就可以找到索引中的任何单词是否与某个字符串进行比较以匹配。对于匹配字符串中的每个字符，尝试继续树索引中的任何指针，并在索引处创建一个新指针。如果指针无法前进到索引中的下一个字符，则将其删除。如果任何指针到达叶注释，则表明您找到了子字符串匹配。
我认为实现这一点并不困难，但也并非微不足道。

回复收藏 0 原文