针对给定单词周围的 N 个单词优化正则表达式 (UTF-8)

发布于 2024-09-16 17:54:16 字数 1883 浏览 6 评论 0原文

我正在尝试找到一个优化的正则表达式来返回另一个单词周围的 N 个单词（如果可用）以构建摘要。该字符串采用 UTF-8 格式，因此“单词”的定义比 [az] 更大。用作参考词的字符串可以位于单词的中间，也可以不直接被空格包围。

我已经有了以下有效的方法，但在围绕另一个单词查找超过 6-7 个单词时，看起来实际上很贪婪并且令人窒息：

/(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,4}lorem(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,4}/u

这是我为此构建的 PHP 方法，但我需要帮助使正则表达式能够不要那么贪婪，为周围的任意数量的单词而努力。

/**
 * Finds N words around a specified word in a string.
 *
 * @param string $string The complete string to look in.
 * @param string $find The string to look for.
 * @param integer $before The number of words to look for before $find.
 * @param integer $after The number of words to look for after $find.
 * @return mixed False if $find was not found and all the words around otherwise.
 */
private function getWordsAround($string, $find, $before, $after)
{
    $matches = array();
    $find = preg_quote($find);
    $regex = '(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,' . (int)$before . '}' .
        $find . '(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,' . (int)$after . '}';
    if (preg_match("/$regex/u", $string, $matches)) {
        return $matches[0];
    } else {
        return false;
    }
}

如果我有以下 $string:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum 
eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu 
fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus 
luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. 
Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."

并调用 getWordsAround($string, 'vitae', 8, 8) 我想得到以下结果：

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit,"

感谢正则表达式专家的帮助。

原文

I'm trying to find an optimized regex to return the N words (if available) around another one to build a summary. The string is in UTF-8, so the definition of "words" is larger than just [a-z]. The string that serves as the reference word could be in the middle of a word or not directly surrounded by spaces.

I've already got the following that works but seems actually greedy and chokes when looking for more than 6-7 words around another one:

/(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,4}lorem(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,4}/u

This is the PHP method I've build to do that but I'd need help getting the regex to be less greedy and work for any number of words around.

/**
 * Finds N words around a specified word in a string.
 *
 * @param string $string The complete string to look in.
 * @param string $find The string to look for.
 * @param integer $before The number of words to look for before $find.
 * @param integer $after The number of words to look for after $find.
 * @return mixed False if $find was not found and all the words around otherwise.
 */
private function getWordsAround($string, $find, $before, $after)
{
    $matches = array();
    $find = preg_quote($find);
    $regex = '(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,' . (int)$before . '}' .
        $find . '(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,' . (int)$after . '}';
    if (preg_match("/$regex/u", $string, $matches)) {
        return $matches[0];
    } else {
        return false;
    }
}

If I had the following $string:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum 
eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu 
fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus 
luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. 
Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."

And called getWordsAround($string, 'vitae', 8, 8) I'd want to get the following result:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit,"

Thank you for your help regex gurus.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

耳钉梦 2024-09-23 17:54:16

使用正则表达式或其他方法将输入文本拆分为单词数组怎么样？然后循环遍历单词以查找目标单词。找到后，获取所需的数组切片，将其连接在一起并打印。

要保持单词之间的原始空白，您可以将其包含在每个单词的末尾。

此外，这可以作为流解析器来实现，而不是首先分割整个字符串。

回复收藏 0 原文

塔塔猫 2024-09-23 17:54:16

正如前面提到的，问题在于大量的回溯。为了解决这个问题，我尝试使用lookbehind和lookahead将匹配锚定到字符串。所以我想出了：

/consectetur(?<=((?:\S+\s+){0,8})\s*consectetur)\s*(?=((?:\S+\s+){0,8}))/

不幸的是，这不起作用，因为PCRE（或perl）不支持可变长度lookbehinds。所以我们剩下：

/consectetur\s*(?:\S+\s+){0,8}/

只捕获匹配字符串和匹配后最多 8 个单词。但是，如果您使用PREG_OFFSET_CAPTURE标志，获取$match[0]的偏移量，取出到该点的子字符串，使用strrev反转字符串，获取前0-8个单词（使用< code>/\s*(?:\S+\s+){0,8}/)，反转匹配并重新组合：

$s = "put test string here";
$matches = array();
if (preg_match('/consectetur\s*(?:\S+\s+){0,8}/', $s, $matches, PREG_OFFSET_CAPTURE)) {
  $before = strrev(substr($s, 0, $matches[0][1]));
  $before_match = array();
  preg_match('/\s*(?:\S+\s+){0,8}/', $before, $before_match);
  echo strrev($before_match[0]) . $matches[0][0];
}

您可以通过采用以下安全子集来使其在非常大的字符串上运行得更快一些匹配之前的字符，例如 100。那么您仅反转 100 个字符的字符串。

话虽如此，不使用正则表达式的解决方案可能效果更好。

As mentioned earlier, the problem is a very large amount of backtracking. To solve this, I tried to use lookbehind and lookahead to anchor the match to the string. So I came up with:

/consectetur(?<=((?:\S+\s+){0,8})\s*consectetur)\s*(?=((?:\S+\s+){0,8}))/

Unfortunately, this does not work, as variable length lookbehinds are not supported in PCRE (or perl for that matter). So we are left with:

/consectetur\s*(?:\S+\s+){0,8}/

Which only captures the match string and up to 8 words after the match. However, if you use the PREG_OFFSET_CAPTURE flag, get the offset of $match[0], take the substring up to that point, reverse the string with strrev, get the first 0-8 words (using /\s*(?:\S+\s+){0,8}/), reverse the match, and recombine:

$s = "put test string here";
$matches = array();
if (preg_match('/consectetur\s*(?:\S+\s+){0,8}/', $s, $matches, PREG_OFFSET_CAPTURE)) {
  $before = strrev(substr($s, 0, $matches[0][1]));
  $before_match = array();
  preg_match('/\s*(?:\S+\s+){0,8}/', $before, $before_match);
  echo strrev($before_match[0]) . $matches[0][0];
}

You can make it a bit faster on very big strings by taking a safe subset of characters prior to the match, like 100. Then you are only reversing a 100 character string.

All that being said, a solution which doesn't use regular expressions may work better.

回复收藏 0 原文

信愁 2024-09-23 17:54:16

这是一个内部 PHP 函数，可以完成您想要的操作。您不太可能在用户空间函数中击败这种性能。

对于 UTF-8 函数使用此功能应该没有问题，因为 '\r'、'\n' 和 ' '（通常是所有 ASCII 字符）不能作为另一个字符序列的一部分出现。因此，如果将有效的 UTF-8 数据传递给两个参数，应该没问题。像通常反转单个字符编码（使用 strrev）一样反转 UTF-8 数据确实会带来麻烦，但此函数不会这样做。

PHP_FUNCTION(surrounding_text)
{
    struct circ_array {
        int *offsets;
        int cur;
        int size;
    } circ_array;
    long before;
    long after;
    char *haystack, *needle;
    int haystack_len, needle_len;
    int i, in_word = 0, in_match = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ssll",
        &haystack, &haystack_len, &needle, &needle_len, &before, &after) 
        == FAILURE)
        return;

    if (needle_len == 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Cannot have empty needle");
        return;
    }

    if (before < 0 || after < 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Number of words after and before should be non-negative");
        return;
    }

    /* saves beggining of match and words before */
    circ_array.offsets = safe_emalloc(before + 1, sizeof *circ_array.offsets, 0);
    circ_array.cur = 0;
    circ_array.size = before + 1;

    for (i = 0; i < haystack_len; i++) {
        if (haystack[i] == needle[in_match]) {
            in_match++;
            if (!in_word) {
                in_word = 1;
                circ_array.offsets[circ_array.cur % circ_array.size] = i;
                circ_array.cur++;
            }
            if (in_match == needle_len)
                break; /* found */
        } else {
            int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';

            if (in_match)
                in_match = 0;

            if (is_sep) {
                if (in_word)
                    in_word = 0;
            } else { /* not a separator */
                if (!in_word) {
                    in_word = 1;
                    circ_array.offsets[circ_array.cur % circ_array.size] = i;
                    circ_array.cur++;
                }
            }
        }
    }

    if (in_match != needle_len) {
        efree(circ_array.offsets);
        RETURN_FALSE;
    }


    /* find words after; in_word is 1 */
    for (i++; i < haystack_len; i++) {
        int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';
        if (is_sep) {
            if (in_word) {
                if (after == 0)
                    break;
                after--;
                in_word = 0;
            }
        } else {
            if (!in_word)
                in_word = 1;
        }
    }

    {
        char *result;
        int start, result_len;
        if (circ_array.cur < circ_array.size)
            start = circ_array.offsets[0];
        else
            start = circ_array.offsets[circ_array.cur % circ_array.size];

        result_len = i - start;
        result = emalloc(result_len + 1);
        memcpy(result, &haystack[start], result_len);
        result[result_len] = '\0';

        efree(circ_array.offsets);
        RETURN_STRINGL(result, result_len, 0);
    }

}

根据我的测试，C 函数比 wuputah 的版本快 4 倍（并且不存在 strrev 的问题）。

Here is an internal PHP function that does what you want. It's unlikely you'll be able to beat this performance-wise in a user-land function.

There should be no problem using this for UTF-8 functions, since '\r', '\n' and ' ' (and in general all the ASCII characters) cannot appear as part of another character sequence. So if you pass valid UTF-8 data to both parameters you should be fine. Reversing UTF-8 data as you would normally reverse single character encodings (with strrev) would indeed mean trouble, but this function doesn't do that.

PHP_FUNCTION(surrounding_text)
{
    struct circ_array {
        int *offsets;
        int cur;
        int size;
    } circ_array;
    long before;
    long after;
    char *haystack, *needle;
    int haystack_len, needle_len;
    int i, in_word = 0, in_match = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ssll",
        &haystack, &haystack_len, &needle, &needle_len, &before, &after) 
        == FAILURE)
        return;

    if (needle_len == 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Cannot have empty needle");
        return;
    }

    if (before < 0 || after < 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Number of words after and before should be non-negative");
        return;
    }

    /* saves beggining of match and words before */
    circ_array.offsets = safe_emalloc(before + 1, sizeof *circ_array.offsets, 0);
    circ_array.cur = 0;
    circ_array.size = before + 1;

    for (i = 0; i < haystack_len; i++) {
        if (haystack[i] == needle[in_match]) {
            in_match++;
            if (!in_word) {
                in_word = 1;
                circ_array.offsets[circ_array.cur % circ_array.size] = i;
                circ_array.cur++;
            }
            if (in_match == needle_len)
                break; /* found */
        } else {
            int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';

            if (in_match)
                in_match = 0;

            if (is_sep) {
                if (in_word)
                    in_word = 0;
            } else { /* not a separator */
                if (!in_word) {
                    in_word = 1;
                    circ_array.offsets[circ_array.cur % circ_array.size] = i;
                    circ_array.cur++;
                }
            }
        }
    }

    if (in_match != needle_len) {
        efree(circ_array.offsets);
        RETURN_FALSE;
    }


    /* find words after; in_word is 1 */
    for (i++; i < haystack_len; i++) {
        int is_sep = haystack[i] == ' ' || haystack[i] == '\n' || haystack[i] == '\r';
        if (is_sep) {
            if (in_word) {
                if (after == 0)
                    break;
                after--;
                in_word = 0;
            }
        } else {
            if (!in_word)
                in_word = 1;
        }
    }

    {
        char *result;
        int start, result_len;
        if (circ_array.cur < circ_array.size)
            start = circ_array.offsets[0];
        else
            start = circ_array.offsets[circ_array.cur % circ_array.size];

        result_len = i - start;
        result = emalloc(result_len + 1);
        memcpy(result, &haystack[start], result_len);
        result[result_len] = '\0';

        efree(circ_array.offsets);
        RETURN_STRINGL(result, result_len, 0);
    }

}

From my tests, the C function is 4 times faster than wuputah's version (and doesn't have the problem of strrev).

回复收藏 0 原文

谢绝鈎搭 2024-09-23 17:54:16

这在这里工作得很好：

(?:[^\s\r\n]*[\s\r\n]+){0,8}(?:[^\s\r\n]*)consectetur(?:[^\s\r\n]*)(?:[\s\r\n]+[^\s\r\n]*){0,8}

给出：

Lorem ipsum dolor sat amet，consectetur adipiscing elit。克拉斯演员，
非车辆性猫科动物 Suscipit，

然而，这个正则表达式的性能绝对是垃圾。我真的不知道如何提高效率，除非不使用正则表达式。

对于接近结尾的单词来说，性能“绝对是垃圾”的原因是引擎尝试在每个字符上开始匹配，然后前进几十个字符，直到最后发现，它找不到您正在寻找的字符串并丢弃所有内容。

This worked fine here:

(?:[^\s\r\n]*[\s\r\n]+){0,8}(?:[^\s\r\n]*)consectetur(?:[^\s\r\n]*)(?:[\s\r\n]+[^\s\r\n]*){0,8}

Gives:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor,
felis non vehicula suscipit,

The performance of this regular expression, however, is absolute crap. I really don't know how to make this more efficient, short of doing it without regular expressions.

The reason for the performance being "absolute crap" for words near the end is that engine tries to start a match on every character and then advances several dozens of characters until it finds out that, in the end, it cannot find the string you're looking for and discards everything.

回复收藏 0 原文

梦醒灬来后我 2024-09-23 17:54:16

使用此正则表达式的问题是它会导致正则表达式引擎灾难性地回溯。尝试次数随着字符串的大小呈指数增长，这是不好的。您可能需要研究原子分组来提高性能。

或者，您可以找到给定单词的第一次出现，然后开始向后和向前查找单词，直到达到所需的长度。伪代码：

$pos = strpos($find);
$result = $find;

foreach $word before $pos {
    $result = $word . $result;
    $count++
    if ($count >= $target)
        break;
}

foreach $word after $pos {
    $result .= $word;
    $count++
    if ($count >= $target)
        break;
}

当然，查找前后的单词以及处理部分字符串可能会变得非常混乱。

The problem with using this regex is that it causes the regex engine to catastrophically backtrack. The number of attempts increases exponentially with the size of the string, and that is no good. You might want to look into atomic grouping to improve performance.

Alternatively you could find the first occurrence of the given word and start looking backwards and forwards for words up to the desired length. Pseudo-ish code:

$pos = strpos($find);
$result = $find;

foreach $word before $pos {
    $result = $word . $result;
    $count++
    if ($count >= $target)
        break;
}

foreach $word after $pos {
    $result .= $word;
    $count++
    if ($count >= $target)
        break;
}

Of course finding the words before and after, and handling partial strings can get really messy.

回复收藏 0 原文

~没有更多了~