正则表达式匹配空白但跳过部分

发布于 2024-11-06 20:58:55 字数 1712 浏览 6 评论 0原文

据我了解，由于正则表达式本质上是无状态的，因此在不补充应用程序逻辑的情况下很难实现复杂的匹配，但是我很想知道以下是否可行。

匹配所有空格，很简单： \s+

但跳过某些分隔符之间的空格，在我的例子中

和

~~< /code>~~单词nostrip。

有什么技巧可以实现这一目标吗？我正在考虑两个单独的匹配，一个用于所有空白，一个用于

 块 nostrip 部分，不知何故从前者中否定后者。

"This is some text NOSTRIP this is more text NOSTRIP some more text."
// becomes
"ThisissometextNOSTRIP this is more text NOSTRIPsomemoretext."

给定标签 nostrip 部分 的嵌套是无关紧要的，我不想解析树 HTML 或任何内容 >，只是整理文本文件，但将空白保存在

块 nostrip部分中，以便明显原因。

（更好？）

这就是我最终选择的。我确信它可以在一些地方进行优化，但目前效果很好。

public function stripWhitespace($html, Array $skipTags = array('pre')){
    foreach($skipTags as &$tag){
        $tag = "<{$tag}.*?/{$tag}>";
    }
    $skipped = array();
    $buffer = preg_replace_callback('#(?<tag>' . implode('|', $skipTags) . ')#si',
        function($match) use(&$skipped){
            $skipped[] = $match['tag'];
            return "\x1D" . (count($skipped) - 1) . "\x1D";
        }, $html
    );
    $buffer = preg_replace('#\s+#si', ' ', $buffer);
    $buffer = preg_replace('#(?:(?<=>)\s|\s(?=<))#si', '', $buffer);
    for($i = count($skipped) - 1; $i >= 0; $i--){
        $buffer = str_replace("\x1D{$i}\x1D", $skipped[$i], $buffer);
    }
    return $buffer;
}

原文

I understand since Regex is essentially stateless, it's rather difficult to achieve complicated matches without resorting to supplementing application logic, however I'm curious to know if the following is possible.

Match all whitespace, easy enough: \s+

But skip whitespace between certain delimiters, in my case ~~<pre> and </pre>~~ the word nostrip.

Are there any tricks to achieve this? I was thinking along the lines of two separate matches, one for all whitespace, and one for ~~<pre> blocks~~ nostrip sections, and somehow negating the latter from the former.

"This is some text NOSTRIP this is more text NOSTRIP some more text."
// becomes
"ThisissometextNOSTRIP this is more text NOSTRIPsomemoretext."

The nesting of given ~~tags~~ nostrip sections is irrelevant, and I'm not trying to parse ~~the tree~~ HTML or anything, just tidying a text file, but saving the whitespace in ~~<pre> blocks~~ nostrip sections for obvious reasons.

(better?)

This is ultimately what I went with. I'm sure it can be optimized in a few places, but it works nicely for now.

public function stripWhitespace($html, Array $skipTags = array('pre')){
    foreach($skipTags as &$tag){
        $tag = "<{$tag}.*?/{$tag}>";
    }
    $skipped = array();
    $buffer = preg_replace_callback('#(?<tag>' . implode('|', $skipTags) . ')#si',
        function($match) use(&$skipped){
            $skipped[] = $match['tag'];
            return "\x1D" . (count($skipped) - 1) . "\x1D";
        }, $html
    );
    $buffer = preg_replace('#\s+#si', ' ', $buffer);
    $buffer = preg_replace('#(?:(?<=>)\s|\s(?=<))#si', '', $buffer);
    for($i = count($skipped) - 1; $i >= 0; $i--){
        $buffer = str_replace("\x1D{$i}\x1D", $skipped[$i], $buffer);
    }
    return $buffer;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忘年祭陌 2024-11-13 20:58:55

如果您使用脚本语言，我会使用多步骤方法。

拉出 NOSTRIP 部分，保存到数组中，并用标记替换（### 或其他内容）
替换所有空格
重新注入所有保存的 NOSTRIP 片段

回复收藏 0 原文

ζ澈沫 2024-11-13 20:58:55

我曾经创建了一组函数来减少 html 输出中的空白：

function minify($html) {
        if(empty($html)) {
                return $html;
        }
        $html = preg_replace('/^(.*)((<pre.*<\/pre>)(.*?))?$/Ues', "parse('$1').'$3'.minify('$4')", $html);
        return $html;
}

function parse($html) {
        var_dump('1'.$html);
        // Replace multiple spaces with a single space
        $html = preg_replace('/(\s+)/m', ' ', $html);
        // Remove spaces that are followed by either > or <
        $html = preg_replace('/ ([<>])/', '$1', $html);
        $html = str_replace('> ', '>', $html);
        return $html;
}

$html = minify($html);

您可能需要稍微修改它以满足您的需求。

I once created a set of functions to reduce white space in html outputs:

function minify($html) {
        if(empty($html)) {
                return $html;
        }
        $html = preg_replace('/^(.*)((<pre.*<\/pre>)(.*?))?$/Ues', "parse('$1').'$3'.minify('$4')", $html);
        return $html;
}

function parse($html) {
        var_dump('1'.$html);
        // Replace multiple spaces with a single space
        $html = preg_replace('/(\s+)/m', ' ', $html);
        // Remove spaces that are followed by either > or <
        $html = preg_replace('/ ([<>])/', '$1', $html);
        $html = str_replace('> ', '>', $html);
        return $html;
}

$html = minify($html);

You'll probably have to modify this slightly to fit your needs.

回复收藏 0 原文

~没有更多了~