使用 RegEx(Perl 风格)选择不包含在另一个标签中的第一个段落标签
我有这个 html 块:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
我正在尝试选择该块中的第一个非嵌套段落。我正在使用 PHP 的(perl 风格)preg_match 来查找它,但似乎无法弄清楚如何忽略 div 中包含的 p 标签。
这是我到目前为止所拥有的,但它选择了上面包含的第一段的内容。
/<p>(.+?)<\/p>/is
谢谢!
编辑
不幸的是,我没有 DOM 解析器。
我完全欣赏不使用 RegEx 解析 HTML 的建议,但这并不能真正帮助我的特定用例。我有一个非常受控的情况,内部应用程序生成结构化文本。如果某些文本与特定模式匹配,我会尝试替换它。这是一个简化的情况,我试图忽略嵌套在其他文本中的文本,而 HTML 是我能想到解释的最简单的情况。我的实际情况看起来更像是这样的(但是有更多的数据并且被缩小):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
我必须将某些行的某一列重新格式化为大量与此类似的行。帮助我的第一个问题将有助于实际项目。
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
你的正则表达式将不起作用。即使您只有非嵌套段落,您的捕获括号也会匹配
First, non-nested ... Last paragraph.
。尝试:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
如果
\1
是p
,则抓取\2
。但 HTML 解析器会做得更好,恕我直言。
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match
First, non-nested ... Last paragraph.
.Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab
\2
if\1
isp
.But an HTML parser would do a better job of that imho.
像这样的事情怎么样?
进行前视以确保它不在结束标记内;但可以位于字符串的末尾。可能有更好的方法来查找段落标签中的内容,但您需要避免过于贪婪(.+? 不够)。
How about something like this?
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
使用
two三步流程。首先,祈祷一切顺利。其次,首先,删除所有嵌套的内容。然后得到你的结果。剩下的一切现在都不是嵌套的。
(这是 Perl。不知道它在 PHP 中会有什么不同)。
Use a
twothree step process. First, pray that everything is well formed. Second,First,remove everything that is nested.Then get your result. Everything that is left is now not nested.
(this is Perl. Don't know how different it would look in PHP).
“你不应该使用正则表达式来解析 HTML。”
每个人都这么说,但没有人真正提供如何实际做到这一点的例子,他们只是宣讲而已。好吧,由于 Levi Morrison 的一些动机,我决定阅读
DomDocument
并计算知道如何去做。对于每个说“哦,学习解析器太难了,我就使用正则表达式”的人。嗯,我以前从未使用过
DomDocument
或 XPath 做过任何事情,这花了我 10 分钟。去阅读 DomDocument 上的文档。 php.net/manual/en/class.domdocument.php" rel="nofollow noreferrer">DomDocument
并按照您应该的方式解析 HTML。"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into
DomDocument
and figure out how to do it.To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with
DomDocument
or XPath before and this took me 10 minutes. Go read the docs onDomDocument
and parse HTML the way you're supposed to.您可能想看看这篇文章关于使用正则表达式解析 HTML。
由于 HTML 不是常规语言(而正则表达式是),因此您无法使用 Regex 拆分任意 HTML 块。使用 HTML 解析器,它会比尝试拼凑一些正则表达式更顺利地完成工作。
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.