使用 RegEx（Perl 风格）选择不包含在另一个标签中的第一个段落标签

发布于 2024-12-20 20:50:13 字数 1031 浏览 0 评论 0原文

我有这个 html 块：

<div>
  <p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>

我正在尝试选择该块中的第一个非嵌套段落。我正在使用 PHP 的（perl 风格）preg_match 来查找它，但似乎无法弄清楚如何忽略 div 中包含的 p 标签。

这是我到目前为止所拥有的，但它选择了上面包含的第一段的内容。

/<p>(.+?)<\/p>/is

谢谢！

编辑

不幸的是，我没有 DOM 解析器。

我完全欣赏不使用 RegEx 解析 HTML 的建议，但这并不能真正帮助我的特定用例。我有一个非常受控的情况，内部应用程序生成结构化文本。如果某些文本与特定模式匹配，我会尝试替换它。这是一个简化的情况，我试图忽略嵌套在其他文本中的文本，而 HTML 是我能想到解释的最简单的情况。我的实际情况看起来更像是这样的（但是有更多的数据并且被缩小）：

#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]

我必须将某些行的某一列重新格式化为大量与此类似的行。帮助我的第一个问题将有助于实际项目。

原文

I have this block of html:

<div>
  <p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>

I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.

This is what I have so far, but it selects the contents of the first paragraph contained above.

/<p>(.+?)<\/p>/is

Thanks!

EDIT

Unfortunately, I don't have the luxury of a DOM Parser.

I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):

#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]

I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

段念尘 2024-12-27 20:50:13

你的正则表达式将不起作用。即使您只有非嵌套段落，您的捕获括号也会匹配 First, non-nested ... Last paragraph.。

尝试：

<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>

如果 \1 是 p，则抓取 \2。

但 HTML 解析器会做得更好，恕我直言。

回复收藏 0 原文

妞丶爷亲个 2024-12-27 20:50:13

像这样的事情怎么样？

<p>([^<>]+)<\/p>(?=(<[^\/]|$))

进行前视以确保它不在结束标记内；但可以位于字符串的末尾。可能有更好的方法来查找段落标签中的内容，但您需要避免过于贪婪（.+? 不够）。

How about something like this?

<p>([^<>]+)<\/p>(?=(<[^\/]|$))

Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).

回复收藏 0 原文

以可爱出名 2024-12-27 20:50:13

使用~~two~~三步流程。首先，祈祷一切顺利。其次，~~首先，~~ 删除所有嵌套的内容。

s{<div>.*?</div>}{}g;         # HTML example
s/#.*?#//g;                   # 2nd example

然后得到你的结果。剩下的一切现在都不是嵌套的。

$result = m{<p>(.*?)</p>};    # HTML example
$result = m{\[(.*?)\]};       # 2nd example

（这是 Perl。不知道它在 PHP 中会有什么不同）。

Use a ~~two~~ three step process. First, pray that everything is well formed. Second, ~~First,~~ remove everything that is nested.

s{<div>.*?</div>}{}g;         # HTML example
s/#.*?#//g;                   # 2nd example

Then get your result. Everything that is left is now not nested.

$result = m{<p>(.*?)</p>};    # HTML example
$result = m{\[(.*?)\]};       # 2nd example

(this is Perl. Don't know how different it would look in PHP).

回复收藏 0 原文

星 2024-12-27 20:50:13

“你不应该使用正则表达式来解析 HTML。”

每个人都这么说，但没有人真正提供如何实际做到这一点的例子，他们只是宣讲而已。好吧，由于 Levi Morrison 的一些动机，我决定阅读 DomDocument 并计算知道如何去做。

对于每个说“哦，学习解析器太难了，我就使用正则表达式”的人。嗯，我以前从未使用过 DomDocument 或 XPath 做过任何事情，这花了我 10 分钟。去阅读 DomDocument 上的文档。 php.net/manual/en/class.domdocument.php" rel="nofollow noreferrer">DomDocument 并按照您应该的方式解析 HTML。

$myHtml = <<<MARKUP
   <html>
       <head>
            <title>something</title></head>
       <body>
            <div>
                <p>not valid</p>
            </div>
            <p>is valid</p>
            <p>is not valid</p>
            <p>is not valid either</p>
            <div>
                <p>definitely not valid</p>
            </div>
       </body>
   </html>
MARKUP;

$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));

var_dump($yourNode)

// output '<p>is valid</p>'

"You shouldn't use regex to parse HTML."

It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.

To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.

$myHtml = <<<MARKUP
   <html>
       <head>
            <title>something</title></head>
       <body>
            <div>
                <p>not valid</p>
            </div>
            <p>is valid</p>
            <p>is not valid</p>
            <p>is not valid either</p>
            <div>
                <p>definitely not valid</p>
            </div>
       </body>
   </html>
MARKUP;

$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));

var_dump($yourNode)

// output '<p>is valid</p>'

回复收藏 0 原文