如何将数组中的行与掩码数组相匹配?

发布于 2024-12-05 19:15:26 字数 316 浏览 4 评论 0原文

我有这样的数组:

array('1224*', '543*', '321*' ...) 其中包含大约 17,00 个“掩码”或前缀。

我有第二个数组:

array('123456789', '123456788', '987654321' ....) 其中包含大约 250,000 个数字。

现在,如何使用掩码/前缀数组有效地匹配第二个数组中的每个数字?

[编辑]

第一个数组仅包含前缀,每个条目末尾只有一个 *

I have array like this:

array('1224*', '543*', '321*' ...) which contains about 17,00 "masks" or prefixes.

I have a second array:

array('123456789', '123456788', '987654321' ....) which contain about 250,000 numbers.

Now, how can I efficiently match every number from the second array using the array of masks/prefixes?

[EDIT]

The first array contains only prefixes and every entry has only one * at the end.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

浸婚纱 2024-12-12 19:15:26

嗯,这是一个解决方案:

初步步骤:

  1. 对数组 1 进行排序,剪掉 * 的部分。

搜索:

  1. 对于数组 2 中的每个数字执行
    1. 查找数组 1 中第一个和最后一个第一个字符与 number 匹配的条目(二分查找)。
    2. 对第二个字符执行相同的操作,这次不是搜索整个数组,而是在 firstlast 之间搜索(二分搜索)。
    3. 对第 n 个字符重复 2,直到找到字符串。

这应该是 O(k*n*log(n)),其中 n 是平均数字长度(以数字为单位),k 是数字数字。


基本上,这是一个一维 Radix 树,为了获得最佳性能,您应该实现它,但它可以是相当难。

Well, here's a solution:

Prelimary steps:

  1. Sort array 1, cutting off the *'s.

Searching:

  1. For each number in array 2 do
    1. Find the first and last entry in array 1 of which the first character matches that of number (binary search).
    2. Do the same for the second character, this time searching not the whole array but between first and last (binary search).
    3. Repeat 2 for the nth character until a string is found.

This should be O(k*n*log(n)) where n is the average number length (in digits) and k the number of numbers.


Basically this is a 1 dimensional Radix tree, for optimal performance you should implement it, but it can be quite hard.

似狗非友 2024-12-12 19:15:26

我的两分钱......

$s = array('1234*', '543*', '321*');
$f = array('123456789', '123456788', '987654321');

foreach ($f as $haystack) {
    echo $haystack."<br>";
    foreach ($s as $needle) {
        $needle = str_replace("*","",$needle);
        echo $haystack "- ".$needle.": ".startsWith($haystack, $needle)."<br>";
    }
}

function startsWith($haystack, $needle) {
    $length = strlen($needle);
    return (substr($haystack, 0, $length) === $needle);
}

为了提高性能,首先对两个数组进行排序并在内部 foreach 循环中添加 exit 子句可能是个好主意。

顺便说一句, startWith 函数来自 SO 中的这个出色的解决方案: PHP 中的startsWith() 和endsWith() 函数

My two cents....

$s = array('1234*', '543*', '321*');
$f = array('123456789', '123456788', '987654321');

foreach ($f as $haystack) {
    echo $haystack."<br>";
    foreach ($s as $needle) {
        $needle = str_replace("*","",$needle);
        echo $haystack "- ".$needle.": ".startsWith($haystack, $needle)."<br>";
    }
}

function startsWith($haystack, $needle) {
    $length = strlen($needle);
    return (substr($haystack, 0, $length) === $needle);
}

To improve performance it might be a good idea to sort both arrays first and to add an exit clause in the inner foreach loop.

By the way, the startWith-function is from this great solution in SO: startsWith() and endsWith() functions in PHP

一张白纸 2024-12-12 19:15:26

另一种选择是在循环中使用 preg_grep :

$masks = array('1224*', '543*', '321*' ...);
$data = array('123456789', '123456788', '987654321' ....);

$matches = array();
foreach($masks as $mask) {
    $mask = substr($mask, 0, strlen($masks) - 2); // strip off trailing *
    $matches[$mask] = preg_grep("/^$mask/", $data);
}

不知道这有多有效,只是将其作为替代方案提供。

Another option would to be use preg_grep in a loop:

$masks = array('1224*', '543*', '321*' ...);
$data = array('123456789', '123456788', '987654321' ....);

$matches = array();
foreach($masks as $mask) {
    $mask = substr($mask, 0, strlen($masks) - 2); // strip off trailing *
    $matches[$mask] = preg_grep("/^$mask/", $data);
}

No idea how efficient this would be, just offering it up as an alternative.

猫瑾少女 2024-12-12 19:15:26

尽管正则表达式并不以速度快而闻名,但我想知道如果将模式简化为其最简洁的形式并且仅调用一次,则 preg_grep() 的执行效果如何(不在循环中)。

通过去除被较短掩模覆盖的较长掩模,图案将大大缩小。会减少多少?当然,我不能肯定地说,但是有17,000个口罩,肯定会有相当多的冗余。

代码:(演示

$masks = ['1224*', '543*', '321*', '12245*', '5*', '122488*'];
sort($masks);
$needle = rtrim(array_shift($masks), '*');
$keep[] = $needle;
foreach ($masks as $mask) {
    if (strpos($mask, $needle) !== 0) {
        $needle = rtrim($mask, '*');
        $keep[] = $needle;
    }
}
// now $keep only contains: ['1224', '321', '5']

$numbers = ['122456789', '123456788', '321876543234567', '55555555555555555', '987654321'];

var_export(
    preg_grep('~^(?:' . implode('|', $keep) . ')~', $numbers)
);

输出:

array (
  0 => '122456789',
  2 => '321876543234567',
  3 => '55555555555555555',
)

Although regex is not famous for being fast, I'd like to know how well preg_grep() can perform if the pattern is boiled down to its leanest form and only called once (not in a loop).

By removing longer masks which are covered by shorter masks, the pattern will be greatly reduced. How much will the reduction be? of course, I cannot say for sure, but with 17,000 masks, there are sure to be a fair amount of redundancy.

Code: (Demo)

$masks = ['1224*', '543*', '321*', '12245*', '5*', '122488*'];
sort($masks);
$needle = rtrim(array_shift($masks), '*');
$keep[] = $needle;
foreach ($masks as $mask) {
    if (strpos($mask, $needle) !== 0) {
        $needle = rtrim($mask, '*');
        $keep[] = $needle;
    }
}
// now $keep only contains: ['1224', '321', '5']

$numbers = ['122456789', '123456788', '321876543234567', '55555555555555555', '987654321'];

var_export(
    preg_grep('~^(?:' . implode('|', $keep) . ')~', $numbers)
);

Output:

array (
  0 => '122456789',
  2 => '321876543234567',
  3 => '55555555555555555',
)
滴情不沾 2024-12-12 19:15:26

查看 PHP 函数 array_intersect_key。

Check out the PHP function array_intersect_key.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文