最快的 PHP 单词匹配例程

发布于 2024-08-28 22:55:21 字数 582 浏览 4 评论 0原文

在 PHP 中获取关键字列表并将其与所有单词的搜索结果(如标题数组)进行匹配的最快方法是什么?

例如,如果我的关键字词组是“很棒的皮鞋”,那么以下标题匹配...

  • 得到一些真正的很棒的皮鞋强>
  • 皮鞋很棒
  • 很棒一天!这些是一些很酷的皮鞋
  • 鞋子,由皮革制成,可以很棒

...而这些不会匹配:

  • >皮鞋今天特价!
  • 你一定会非常喜欢这些皮鞋
  • 很棒的鞋子价格不菲

我想有一些数组函数或RegEx(正则表达式)的技巧可以快速实现这一点。

What is the fastest way in PHP to take a keyword list and match it to a search result (like an array of titles) for all words?

For instance, if my keyword phrase is "great leather shoes", then the following titles would be a match...

  • Get Some Really Great Leather Shoes
  • Leather Shoes Are Great
  • Great Day! Those Are Some Cool Leather Shoes!
  • Shoes, Made of Leather, Can Be Great

...while these would not be a match:

  • Leather Shoes on Sale Today!
  • You'll Love These Leather Shoes Greatly
  • Great Shoes Don't Come Cheap

I imagine there's some trick with array functions or a RegEx (Regular Expression) to achieve this rapidly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

骄兵必败 2024-09-04 22:55:21

我会为标题中的单词使用索引,并测试每个搜索词是否都在该索引中:

$terms = explode(' ', 'great leather shoes');
$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great'
);
foreach ($titles as $title) {
    // extract words in lowercase and use them as key for the word index
    $wordIndex = array_flip(preg_split('/\P{L}+/u', mb_strtolower($title), -1, PREG_SPLIT_NO_EMPTY));
    // look up if every search term is in the index
    foreach ($terms as $term) {
        if (!isset($wordIndex[$term])) {
            // if one is missing, continue with the outer foreach
            continue 2;
        }
    }
    // echo matched title
    echo "match: $title";
}

I would use an index for the words in the titles and test if every search term is in that index:

$terms = explode(' ', 'great leather shoes');
$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great'
);
foreach ($titles as $title) {
    // extract words in lowercase and use them as key for the word index
    $wordIndex = array_flip(preg_split('/\P{L}+/u', mb_strtolower($title), -1, PREG_SPLIT_NO_EMPTY));
    // look up if every search term is in the index
    foreach ($terms as $term) {
        if (!isset($wordIndex[$term])) {
            // if one is missing, continue with the outer foreach
            continue 2;
        }
    }
    // echo matched title
    echo "match: $title";
}
瀟灑尐姊 2024-09-04 22:55:21

你可以 preg_grep() 你的数组针对类似的东西

 /^(?=.*?\bgreat)(?=.*?\bleather)(?=.*?\shoes)/

或(可能更快)分别 grep 每个单词,然后 array_intersect 结果

you can preg_grep() your array against something like

 /^(?=.*?\bgreat)(?=.*?\bleather)(?=.*?\shoes)/

or (probably faster) grep each word separately and then array_intersect the results

国粹 2024-09-04 22:55:21

这可能是一个非常幼稚的解决方案(很可能有更高效/优雅的解决方案),但我可能会做类似以下的事情:

$keywords = array(
    'great',
    'leather',
    'shoes'
);

$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great',
    'Leather Shoes on Sale Today!',
    'You\'ll Love These Leather Shoes Greatly',
    'Great Shoes Don\'t Come Cheap'
);

$matches = array();
foreach( $titles as $title )
{
  $wordsInTitle = preg_split( '~\b(\W+\b)?~', $title, null, PREG_SPLIT_NO_EMPTY );
  if( array_uintersect( $keywords, $wordsInTitle, 'strcasecmp' ) == $keywords )
  {
    // we have a match
    $matches[] = $title;
  }
}

var_dump( $matches );

不知道这个基准如何。

It might be a pretty naive solution (quite possibly there are more efficient/elegant solutions), but I'ld probably do something like the following:

$keywords = array(
    'great',
    'leather',
    'shoes'
);

$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great',
    'Leather Shoes on Sale Today!',
    'You\'ll Love These Leather Shoes Greatly',
    'Great Shoes Don\'t Come Cheap'
);

$matches = array();
foreach( $titles as $title )
{
  $wordsInTitle = preg_split( '~\b(\W+\b)?~', $title, null, PREG_SPLIT_NO_EMPTY );
  if( array_uintersect( $keywords, $wordsInTitle, 'strcasecmp' ) == $keywords )
  {
    // we have a match
    $matches[] = $title;
  }
}

var_dump( $matches );

No idea how this benchmarks though.

陈独秀 2024-09-04 22:55:21

您可以使用

/(?=.*?\great\b)(?=.*?\bshoes\b)(?=.*?\bleather\b)/

注意以下几点:

a)您需要两端都有单词边界,否则您最终可能会匹配包含您正在寻找的单词,例如“shoes of Leather Bring Greatness”。

b)我使用惰性通配符匹配(即.*?)。这提高了效率,因为默认情况下 * 是贪婪的(即它消耗尽可能多的匹配字符,并且只为了整体匹配而放弃它们)。因此,如果我们没有尾随 ?,.* 将匹配行中的所有内容,然后回溯以匹配“great”。然后对“鞋子”和“皮革”重复相同的过程。通过使 * 变得懒惰,我们可以避免这些不必要的回溯。

You could use

/(?=.*?\great\b)(?=.*?\bshoes\b)(?=.*?\bleather\b)/

Note a couple of things

a)You need word boundaries at both ends else you could end up matching words that contain the ones you are looking for eg "shoes of leather bring greatness".

b)I use lazy wildcard match (i.e .*?). This improves effeciency, as by default * is greedy (i.e. it consumes as many characters as it can match, and only gives them up in favor of a overall match). So if we don't have the trailing ?, .* will match everything in the line and then backtrack to match 'great'. Same procedure is then repeated for 'shoes' and 'leather'. By making * lazy, we avoid these unnecessary backtracks.

伤感在游骋 2024-09-04 22:55:21

我不知道绝对最快的方法,但这可能是使用正则表达式执行此操作的最快方法:

'#(?:\b(?>great\b()|leather\b()|shoes\b()|\w++\b)\W*+)++\1\2\3#i'

这会匹配字符串中的每个单词,并且如果该单词恰好是您的单词之一关键字,空捕获组“检查它”。一旦字符串中的所有单词都已匹配,反向引用 (\1\2\3) 将确保三个关键字中的每一个都至少出现一次。

通常建议用于此类任务的基于前瞻的方法需要多次扫描整个字符串 - 每个关键字一次。该正则表达式只需扫描字符串一次 - 事实上,所有格量词 (++, *+) 和原子组 (( ?>...))。

也就是说,我仍然会采用前瞻方法,除非我知道它会造成瓶颈。在大多数情况下,其更高的可读性值得在性能上进行权衡。

I don't know about the absolute fastest way, but this is probably the fastest way to do it with a regex:

'#(?:\b(?>great\b()|leather\b()|shoes\b()|\w++\b)\W*+)++\1\2\3#i'

This matches every word in the string, and if the word happens to be one of your keywords, the empty capturing group "checks it off". Once all the words in the string have been matched, the back-references (\1\2\3) ensure that each of the three keywords has been seen at least once.

The lookahead-based approach that's usually recommended for this kind of task needs to scan potentially the whole string multiple times--once for each keyword. This regex only has to scan the string once--in fact, backtracking is disabled by the possessive quantifiers (++, *+) and atomic groups ((?>...)).

That said, I would still go with the lookahead approach unless I knew it it was causing a bottleneck. In most cases, its greater readability is worth the trade-off in performance.

橪书 2024-09-04 22:55:21

我无法为您提供明确的答案,但我会尝试对建议的每个解决方案进行基准测试,并从链接一些 in_array 在一起。

if (in_array('great', $list) && in_array('leather', $list) && in_array('shoes', $list)) {
    // Do something
}

I can't offer you a definitive answer but I'd try benchmarking each solution that's suggested and would start with chaining some in_array's together.

if (in_array('great', $list) && in_array('leather', $list) && in_array('shoes', $list)) {
    // Do something
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文