正则表达式匹配空白但跳过部分
据我了解,由于正则表达式本质上是无状态的,因此在不补充应用程序逻辑的情况下很难实现复杂的匹配,但是我很想知道以下是否可行。
匹配所有空格,很简单: \s+
但跳过某些分隔符之间的空格,在我的例子中 和
< /code>
单词
nostrip
。
有什么技巧可以实现这一目标吗?我正在考虑两个单独的匹配,一个用于所有空白,一个用于
nostrip 部分,不知何故从前者中否定后者。块
"This is some text NOSTRIP this is more text NOSTRIP some more text."
// becomes
"ThisissometextNOSTRIP this is more text NOSTRIPsomemoretext."
给定 标签 nostrip 部分 的嵌套是无关紧要的,我不想解析树 HTML 或任何内容 >,只是整理文本文件,但将空白保存在
nostrip部分中,以便明显原因。块
(更好?)
这就是我最终选择的。我确信它可以在一些地方进行优化,但目前效果很好。
public function stripWhitespace($html, Array $skipTags = array('pre')){
foreach($skipTags as &$tag){
$tag = "<{$tag}.*?/{$tag}>";
}
$skipped = array();
$buffer = preg_replace_callback('#(?<tag>' . implode('|', $skipTags) . ')#si',
function($match) use(&$skipped){
$skipped[] = $match['tag'];
return "\x1D" . (count($skipped) - 1) . "\x1D";
}, $html
);
$buffer = preg_replace('#\s+#si', ' ', $buffer);
$buffer = preg_replace('#(?:(?<=>)\s|\s(?=<))#si', '', $buffer);
for($i = count($skipped) - 1; $i >= 0; $i--){
$buffer = str_replace("\x1D{$i}\x1D", $skipped[$i], $buffer);
}
return $buffer;
}
I understand since Regex is essentially stateless, it's rather difficult to achieve complicated matches without resorting to supplementing application logic, however I'm curious to know if the following is possible.
Match all whitespace, easy enough: \s+
But skip whitespace between certain delimiters, in my case the word <pre>
and </pre>
nostrip
.
Are there any tricks to achieve this? I was thinking along the lines of two separate matches, one for all whitespace, and one for nostrip sections, and somehow negating the latter from the former.<pre>
blocks
"This is some text NOSTRIP this is more text NOSTRIP some more text."
// becomes
"ThisissometextNOSTRIP this is more text NOSTRIPsomemoretext."
The nesting of given tags nostrip sections is irrelevant, and I'm not trying to parse the tree HTML or anything, just tidying a text file, but saving the whitespace in nostrip sections for obvious reasons.<pre>
blocks
(better?)
This is ultimately what I went with. I'm sure it can be optimized in a few places, but it works nicely for now.
public function stripWhitespace($html, Array $skipTags = array('pre')){
foreach($skipTags as &$tag){
$tag = "<{$tag}.*?/{$tag}>";
}
$skipped = array();
$buffer = preg_replace_callback('#(?<tag>' . implode('|', $skipTags) . ')#si',
function($match) use(&$skipped){
$skipped[] = $match['tag'];
return "\x1D" . (count($skipped) - 1) . "\x1D";
}, $html
);
$buffer = preg_replace('#\s+#si', ' ', $buffer);
$buffer = preg_replace('#(?:(?<=>)\s|\s(?=<))#si', '', $buffer);
for($i = count($skipped) - 1; $i >= 0; $i--){
$buffer = str_replace("\x1D{$i}\x1D", $skipped[$i], $buffer);
}
return $buffer;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您使用脚本语言,我会使用多步骤方法。
I you are using a scripting language, I would use a multi-step approach.
我曾经创建了一组函数来减少 html 输出中的空白:
您可能需要稍微修改它以满足您的需求。
I once created a set of functions to reduce white space in html outputs:
You'll probably have to modify this slightly to fit your needs.