优化近似重复值搜索

发布于 2024-09-02 11:59:36 字数 2950 浏览 5 评论 0原文

我试图在一组字段中查找接近重复的值,以便管理员可以清理它们。

我匹配的有两个标准

  1. 一个字符串完全包含在另一个字符串中,并且至少是其长度的 1/4
  2. 字符串的编辑距离小于两个字符串总长度的 5%

伪 PHP code:

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

我尝试尽可能减少对相对昂贵的 striposlevenshtein 函数的调用,这大大减少了执行时间。 然而,作为 O(n^2) 操作,这不能扩展到更大的值集,并且似乎大量的处理时间花费在简单地迭代数组上。

正在操作的几组值的一些属性

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

是否还有其他事情可以减少检查条件的时间,更重要的是有什么方法可以减少所需的条件检查的数量(例如,通过预检查) -处理输入值),因为选择性如此低?


编辑:已实施的解决方案

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}

I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up.

There are two criteria that I am matching on

  1. One string is wholly contained within the other, and is at least 1/4 of its length
  2. The strings have an edit distance less than 5% of the total length of the two strings

The Pseudo-PHP code:

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit.
However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays.

Some properties of a few sets of values being operated on

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?


Edit: Implemented solution

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

安静 2024-09-09 11:59:37

您可以首先按长度对字符串进行排序 ( O(N) ),然后仅检查较小的字符串是子字符串还是较大的字符串,并且仅检查差异不太大的字符串对中的 levenshtein。

您已经执行了这些检查,但现在您对所有 N x N 对执行此操作,同时按长度预先选择第一个将帮助您减少首先检查的对。避免 N x N 循环,即使它只包含会失败的测试。

对于子字符串匹配,您可以通过为所有较小的项目创建索引来进一步改进,并在解析较大的项目时相应地更新它。索引应该可以形成以字母为分支的树结构,其中每个单词(字符串)形成从根到叶的路径。这样您就可以找到索引中的任何单词是否与某个字符串进行比较以匹配。对于匹配字符串中的每个字符,尝试继续树索引中的任何指针,并在索引处创建一个新指针。如果指针无法前进到索引中的下一个字符,则将其删除。如果任何指针到达叶注释,则表明您找到了子字符串匹配。
我认为实现这一点并不困难,但也并非微不足道。

You could first order the strings by length ( O(N) ) and then only check smaller strings to be substrings or larger strings, plus only check with levenshtein in string pairs for which the difference is not too large.

You already perform these checks, but now you do it for all N x N pairs, while preselecting first by length will help you reduce the pairs to check first. Avoid the N x N loop, even if it contains only tests that will fail.

For substring matching you could further improve by creating an index for all smaller items, and update this accordingly as you parse larger items. The index should can form a tree structure branching on letters, where each word (string) forms a path from root to leaf. This way you can find if any of the words in the index compare to some string to match. For each character in your match string try to proceed any pointers in the tree index, and create a new pointer at the index. If a pointer can not be proceeded to a following character in the index, you remove it. If any pointer reaches a leaf note, you've found a substring match.
Implementing this is, I think, not difficult, but not trivial either.

随波逐流 2024-09-09 11:59:37

通过收紧内环,你可以立即获得 100% 的进步。您的结果中没有出现重复的匹配项吗?

对于预处理步骤,我将执行并计算字符频率(假设您的字符集很小,例如 a-z0-9,鉴于您使用的是 stripos,我认为这是可能的)。然后不是比较序列(昂贵)而是比较频率(便宜)。这会给您带来误报,您可以忍受这些误报,也可以将其插入到您当前必须清除的测试中。

You can get an instant 100% improvement by tightening your inner loop. Aren't you getting duplicate matches in your results?

For a preprocess step I'd go through and calculate character frequencies (assuming your set of characters is small like a-z0-9, which, given that you're using stripos, I think is likely). Then rather than comparing sequences (expensive) compare frequencies (cheap). This will give you false positives which you can either live with, or plug into the test you've currently got to weed out.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文