仅当 HTML 属性内容中没有换行符时,正则表达式才用空格替换换行符

发布于 2024-09-28 13:07:37 字数 753 浏览 2 评论 0原文

我正在尝试编写一个正则表达式来替换文本文件的某些区域之间的换行符,但仅限于纯文本内容(即排除 HTML 属性内容内的文本,如 href),但在第一部分之后运气不佳。

输入示例:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-
link-that-breaks">This is an example.</a> This is an example. This is yet another
example.
END CONTENT
COMMENTS: 0

输出示例:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-link-that-breaks">This is an example.</a> This is an example. This is yet another example.
END CONTENT
COMMENTS: 0

因此,理想情况下,如果换行符以纯文本形式出现,则空格会替换换行符,但如果它们位于 HTML 参数(主要是 href,如果我必须将其限制为)内,则将其删除而不添加空格。那)。

I'm trying to write a regular expression that replaces line feeds between certain areas of a text file, but only on plain text content (i.e. excludes text inside HTML attribute contents, like href) but not having much luck past the first part.

Example input:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-
link-that-breaks">This is an example.</a> This is an example. This is yet another
example.
END CONTENT
COMMENTS: 0

Example output:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-link-that-breaks">This is an example.</a> This is an example. This is yet another example.
END CONTENT
COMMENTS: 0

So ideally, a space replaces line breaks if they occur in plain text, but removes them without adding a space if they are inside HTML parameters (mostly href, and I'm fine if I have to limit it to that).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

花想c 2024-10-05 13:07:37

这将删除属性值中的换行符,假设这些值用双引号括起来:

$s = preg_replace(
       '/[\r\n]+(?=[^<>"]*+"(?:[^<>"]*+"[^"<>]*+")*+[^<>"]*+>)/',
       '', $s);

前瞻断言,在当前位置(找到换行符的位置)和下一个 > 之间,有一个奇数双引号的数量。这不允许使用单引号值或值内的尖括号;如果需要的话,两者都可以容纳,但这已经够难看的了。 ;)

之后,您可以用空格替换任何剩余的换行符:

$s = preg_replace('/[\r\n]+/', ' ', $s);

在 ideone.com 上查看它的实际效果。

This will remove newlines in attribute values, assuming the values are enclosed in double-quotes:

$s = preg_replace(
       '/[\r\n]+(?=[^<>"]*+"(?:[^<>"]*+"[^"<>]*+")*+[^<>"]*+>)/',
       '', $s);

The lookahead asserts that, between the current position (where the newline was found) and the next >, there's an odd number of double-quotes. This doesn't allow for single-quoted values, or for angle brackets inside the values; both can be accommodated if need be, but this is ugly enough already. ;)

After that, you can replace any remaining newlines with spaces:

$s = preg_replace('/[\r\n]+/', ' ', $s);

See it in action on ideone.com.

樱花落人离去 2024-10-05 13:07:37

理想情况下,您可以使用真正的 HTML 解析器(或 XML,它是 XHTML)并用它替换属性内容。

但是,如果引擎支持任意长度的正向后查找,则以下方法可能会解决问题:

(?<=\<[^<>]+=\s*("[^"]*|'[^']*))[\r\n]+

用法:用空字符串替换此正则表达式的所有出现。

Ideally you would use a real HTML parser (or XML it it was XHTML) and replace the attribute contents with that.

However, the following may do the trick if the engine supports positive lookbehind of arbitrary length:

(?<=\<[^<>]+=\s*("[^"]*|'[^']*))[\r\n]+

Usage: Replace all occurences of this regex with an empty string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文