用于在 之间进行选择的正则表达式同时忽略任何 <> 内的所有文本

发布于 2024-12-17 01:19:01 字数 1160 浏览 2 评论 0 原文

我有以下两种类型的文本:

类型一:

<div class="meta-name">Corporate Officers</div>
<div class="meta-data"><table border="0" cellspacing="0" cellpadding="0" width="171">
<col width="171"></col>
<tbody>
<tr height="19">
<td width="171" height="19">Officer One</td>
</tr>
</tbody>
</table> 
</div>
</div>

类型二:

<div class="meta-name">Corporate Officers</div>
<div class="meta-data">Officer Two</div>
</div>
<pre>

我将 php 与 preg_match_all 一起使用。我需要一个表达式来返回上面的一号警官和二号警官。我正在使用公司官员< /div> 作为第一个锚点并且< /div> 作为第二个,但我在所有桌子的乱码中找不到 Keith Dennis。

如何返回anchor1和anchor2之间的文本,同时忽略之间任何括号 <> 内的所有文本?

我看到了这些线程,但无法让他们的解决方案为我工作: 正则表达式:提取所有内容,直到 X,其中 X不在两个大括号之间

一切,但 [ 和之间的一切]

I have the following two types of text:

Type one:

<div class="meta-name">Corporate Officers</div>
<div class="meta-data"><table border="0" cellspacing="0" cellpadding="0" width="171">
<col width="171"></col>
<tbody>
<tr height="19">
<td width="171" height="19">Officer One</td>
</tr>
</tbody>
</table> 
</div>
</div>

Type two:

<div class="meta-name">Corporate Officers</div>
<div class="meta-data">Officer Two</div>
</div>
<pre>

I'm using php with preg_match_all. I need a single expression that will return Officer One and Officer Two from the above. I'm using Corporate Officers< /div> as the first anchor and< /div> as the second, but I can't find Keith Dennis inside all that table gibberish.

How do I return text between anchor1 and anchor2 while ignoring all text inside any brackets <> between?

I saw these threads but wasn't able to make their solutions work for me:
RegEx: extract everything until X where X is not between two braces

everything, but everything between [ and ]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凡尘雨 2024-12-24 01:19:01

对于 SimpleXML

$xml = new SimpleXMLElement('<div>
    <div class="meta-name">
        Corporate Officers
    </div>
    <div class="meta-data">
        <table border="0" cellspacing="0" cellpadding="0" width="171">
            <col width="171" />
            <tbody>
                <tr height="19">
                    <td width="171" height="19">
                        Officer One
                    </td>
                </tr>
            </tbody>
        </table>
    </div>
</div>
');

$results = array();
foreach($xml->children() as $node) {
    if($node->getName() == 'div') {
        $attributes = $node->attributes();
        $classes = explode(' ', $attributes['class']);
        if(in_array('meta-name', $classes) || in_array('meta-data', $classes)) {
            $results[] = getText($node);
        }
    }
}

function getText($node) {
    $text = trim(sprintf('%s', $node));
    if(strlen($text) !== 0) {
        return $text;
    }

    foreach($node->children() as $child) {
        if($text = getText($child)) {
            return $text;
        }
    }

    return null;
}

var_dump($results);

作为一般经验法则,永远不要使用 Regex 来解析HTML。

With SimpleXML:

$xml = new SimpleXMLElement('<div>
    <div class="meta-name">
        Corporate Officers
    </div>
    <div class="meta-data">
        <table border="0" cellspacing="0" cellpadding="0" width="171">
            <col width="171" />
            <tbody>
                <tr height="19">
                    <td width="171" height="19">
                        Officer One
                    </td>
                </tr>
            </tbody>
        </table>
    </div>
</div>
');

$results = array();
foreach($xml->children() as $node) {
    if($node->getName() == 'div') {
        $attributes = $node->attributes();
        $classes = explode(' ', $attributes['class']);
        if(in_array('meta-name', $classes) || in_array('meta-data', $classes)) {
            $results[] = getText($node);
        }
    }
}

function getText($node) {
    $text = trim(sprintf('%s', $node));
    if(strlen($text) !== 0) {
        return $text;
    }

    foreach($node->children() as $child) {
        if($text = getText($child)) {
            return $text;
        }
    }

    return null;
}

var_dump($results);

As a general rule of thumb, never use Regex to parse HTML.

你げ笑在眉眼 2024-12-24 01:19:01

大约 80% 的正则表达式问题与 xml/html/xhtml 有关。大约 75% 的答案是使用正则表达式。为什么?因为虽然它似乎适用于您的示例,但它很脆弱,并且可能会因输入的轻微变化而崩溃。

请看看这个 漂亮的工具。如果您无法使用它,请回来,我们将提供帮助。

About 80% of regex questions is about xml/html/xhtml. And about 75% of the answer is to not use a regex. Why? Because while it may seem to work for your example it will be fragile and may break with a slight change of the input.

Please take a look at this beautiful tool. If you can't use it then come back and we will provide with help.

や莫失莫忘 2024-12-24 01:19:01

尝试这个正则表达式:

'~<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>(?:<(?!/?div\b)[^>]*>|\s+)*\K[^<]+~'

这是基于以下假设:HTML 中的开始

标记和您要查找的名称之间没有其他文本内容。第一部分是不言自明的:

<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>

我假设“Corporate Leaders”文本足以找到起点,但如果需要,您可以重新插入 class 属性。之后,

(?:<(?!/?div\b)[^>]*>|\s+)*

...使用

标签之外的任意数量的标签以及任何中间的空格。然后\K出现并说忘记这一切,真正的比赛从这里开始[^<]+ 消耗直到下一个标签开头的所有内容,这就是您在匹配结果中看到的所有内容。就好像 \K 之前的所有内容实际上都是积极的后向查找,但没有所有限制。

这是一个演示

Try this regex:

'~<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>(?:<(?!/?div\b)[^>]*>|\s+)*\K[^<]+~'

This is based on the assumption that there's no other text content in the HTML between the opening <div> tags and the names you're looking for. The first part is self-explanatory:

<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>

I'm assuming the "Corporate Officers" text is sufficient to locate the starting point, but you can reinsert the class attributes if necessary. After that,

(?:<(?!/?div\b)[^>]*>|\s+)*

...consumes any number of tags other than <div> or </div> tags, along with any intervening whitespace. Then \K comes along and says forget all that, the real match starts here. [^<]+ consumes everything up to the beginning of the next tag, and that's all you see in the match results. It's as if everything before the \K was really a positive lookbehind, but without all the restrictions.

Here's a demo.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文