PHP Preg_替换标签之间的数据,尊重文档中的其他标签
对此可能有一个非常简单的答案,但我想尽可能详细,这样您就不需要我来澄清。
我正在尝试收集每个内容
<content><div>CONTENT</div></content>
需要作为反向引用返回($1
)。内容和 div 都有不同的参数(例如 style="color:white;"
)。这些参数并不重要,但仍然存在。
复杂的是 div 可能包含子 div。这些并不重要,但与我当前的正则表达式冲突 - 提前停止比赛。
这是代码示例,想象一下这个复制/粘贴多次并且格式不同。
<entry>
<title>A general title of a post</title>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
This is a description of the title. It may <b>contain bold text</b> or <div>even divs</div>, and everything else. It is not quite important to save these tags, but they exist nonetheless.
</div>
</content>
</entry>
目前,我正在使用两个正则表达式代码。一份用于声明,一份用于结束标签。这可行,但现在我需要对内容执行代码。因此,我将使用 preg_replace_callback()
,但我不知道如何连接两者以便中间是回调。
声明:
<content \w+\s*=\s*\".*?\">[\r\n\s]{0,}<div \w+\s*=\s*\".*?\">
结束语:
</div>[\r\n\s]{0,}</content>
我需要将这些组合起来,并将内容作为回调返回。我尝试过类似 ([\w\W]{0,})
的方法,它绝对返回所有内容,但这场匹配不会在结束 div 处停止。
所以我发现了 \bFULLWORD\b
命令,并将 \bdiv\b
扔在了上面......但我没有成功让它发挥作用。也许 PHP 不支持?或者我很愚蠢。
我不知道。
请帮忙!
There is probably a very simple answer to this, but I want to be as detailed as possible so that you do not need me to clarify.
I am trying to collect the contents of every
<content><div>CONTENT</div></content>
The content needs to be returned as a backreference ($1
). Both the content and the div have differing parameters (such as style="color: white;"
). These parameters are unimportant, but exist nonetheless.
The complication is that the div may contain child div's. These are not important, but conflict with my current regex - stopping the match early.
Here is a sample of the code, imagine this copy/pasted several times and formatted differently.
<entry>
<title>A general title of a post</title>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
This is a description of the title. It may <b>contain bold text</b> or <div>even divs</div>, and everything else. It is not quite important to save these tags, but they exist nonetheless.
</div>
</content>
</entry>
Currently, I am using two regex codes. One for the declaration, and one for the closing tags. This works, but now I need to execute code on the contents. So, I will use preg_replace_callback()
, but I can't figure out how to connect the two so that the middle is a callback.
Declaration:
<content \w+\s*=\s*\".*?\">[\r\n\s]{0,}<div \w+\s*=\s*\".*?\">
Closing:
</div>[\r\n\s]{0,}</content>
I need these combined, with the contents returned as a callback. I have tried something like ([\w\W]{0,})
, which returns absolutely everything, but this match doesn't stop at the closing div.
So I found out about the \bFULLWORD\b
command, and threw \bdiv\b
on that... But I have had no success getting that to work. Perhaps it is not supported by PHP? Or I am stupid.
I do not know.
Please help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以前已经说过了,现在也已经说过了,不幸的是,还会再说一遍。正则表达式是一个很棒的工具。它非常适合操作字符串和正则表达式的模式匹配。
HTML 不是字符串。 HTML 是一种标记语言,而不是常规语言。它实际上并不是一个字符串,但可以被解释为一个字符串(这就是为什么我们在技术上可以使用正则表达式来操作 HTML)。 HTML 是它自己的基于元素节点的语言,如果要更改某些内容,则需要操作这些元素。
正如评论中指出的,您可以轻松使用 DOM 类在 PHP 中。
你想要这样做有很多原因:
如何?
It's been said before and it's being said now, and unfortunately it's going to be said again. Regex is a wonderful tool. It's great for manipulating strings and pattern matching of regular expressions.
HTML is not a string. HTML is a markup language, not a regular language. It's not truthfully a string, but can be interpreted as one (and thus, why we can technically use regex to manipulate HTML). HTML is it's own language based on element nodes, you need to manipulate those elements if you're going to change something.
As pointed out in the comments, you can easily use the DOM class in PHP.
You want to do this for quite a few reasons:
How?
使用 DOM 解析器。这是一个示例: http://htmlparsing.com/php.html
Use a DOM parser. Here's an example: http://htmlparsing.com/php.html