如何为 preg_match_all 创建模式

发布于 2024-12-14 22:01:56 字数 244 浏览 2 评论 0原文

我尝试用谷歌搜索这个，但找不到任何明确的内容。首先，我希望有人可以帮助我编写一个模式来获取这些标签之间的信息：

<vboxview leftinset="10" rightinset="0" stretchiness="1">    // CONTENT INSIDE HERE </vboxview>

其次，您能否详细解释每个部分的模式以及它的作用以及如何指定获取代码的特定部分。

原文

I tried googling this but I couldnt find anything clear about it. first I was hoping someone could help me write a pattern to get the info between these tags :

<vboxview leftinset="10" rightinset="0" stretchiness="1">    // CONTENT INSIDE HERE </vboxview>

and second, could you also please explain the pattern in details for each section and what it does and how you specify to get a certain part of the code.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尘曦 2024-12-21 22:01:56

请参阅我对这个问题的评论，了解我对基于 SGML 的语言和正则表达式的咆哮......

现在我的答案。

如果您知道相关标签内不会有任何其他 HTML/XML 元素，那么这将工作得很好：

<vboxview\s(?P<vboxviewAttributes>(\\>|[^>])*)>(?P<vboxviewContent>(\\<|[^<])*)</vboxview>

分解后，此表达式表示：

<vboxview                  # match `<vboxview` literally
\s+                        # match at least one whitespace character
(?P<vboxviewAttributes>    # begin capture (into a group named "vboxViewAttributes")
   (\\>|[^>])*             #    any number of (either `\>` or NOT `>`)
)                          # end capture
>                          # match a `>` character
(?P<vboxviewContent>       # begin capture (into a group named "vboxViewContent")
   (\\<|[^<])*             #    any number of (either `\<` or NOT `<`)
)                          # end capture
</vboxview>                # match `</vboxview>` literally

您将需要转义和 > 字符源作为 \> 甚至更好作为 HTML/XML 实体

如果内部有嵌套结构，那么您要么开始遇到正则表达式问题，或者您将已经决定使用另一种不涉及正则表达式的方法 - 任何一种方法都足够了！

See my comment on the question for my rant on SGML-based languages and regex...

Now to my answer.

If you know there will not be any other HTML/XML elements inside the tag in question, then this will work quite well:

<vboxview\s(?P<vboxviewAttributes>(\\>|[^>])*)>(?P<vboxviewContent>(\\<|[^<])*)</vboxview>

Broken down, this expression says:

<vboxview                  # match `<vboxview` literally
\s+                        # match at least one whitespace character
(?P<vboxviewAttributes>    # begin capture (into a group named "vboxViewAttributes")
   (\\>|[^>])*             #    any number of (either `\>` or NOT `>`)
)                          # end capture
>                          # match a `>` character
(?P<vboxviewContent>       # begin capture (into a group named "vboxViewContent")
   (\\<|[^<])*             #    any number of (either `\<` or NOT `<`)
)                          # end capture
</vboxview>                # match `</vboxview>` literally

You will need to escape and > characters inside the source as \> or even better as HTML/XML entities

If there are going to be nested constructs inside, then you are either going to start running into problems with regex, or you will have already decided to use another method that does not involve regex - either way is sufficient!

回复收藏 0 原文

失退 2024-12-21 22:01:56

正如评论中所提到的，尝试使用正则表达式从 HTML 中提取内容通常不是一个好主意。如果您想切换到更可靠的方法，这里有一个快速示例，说明如何使用 DOMDocument API。

<?php
function get_vboxview($html) {

    $output = array();

    // Create a new DOM object
    $doc = new DOMDocument;

    // load a string in as html
    $doc->loadHTML($html);

    // create a new Xpath object to query the document with
    $xpath = new DOMXPath($doc);

    // an xpath query that looks for a vboxview node anywhere in the DOM
    // with an attribute named leftinset set to 10, an attribute named rightinset
    // set to 0 and an attribute named stretchiness set to 1
    $query = '//vboxview[@leftinset=10 and @rightinset=0 and @stretchiness=1]';

    // query the document
    $matches = $xpath->query($query);

    // loop through each matching node
    // and the textContent to the output
    foreach ($matches as $m) {
            $output[] = $m->textContent;
    }

    return $output;
}
?>

更好的是，如果保证您的输入中只有一个 vboxview （同时假设您可以控制 HTML），您可以向 vboxview< 添加一个 id 属性/code> 并将代码缩减为更短、更通用的函数。

<?php
function get_node_text($html, $id) {
    // Create a new DOM object
    $doc = new DOMDocument;

    // load a string in as html
    $doc->loadHTML($html);

    // return the textContent of the node with the id $id
    return $doc->getElementById($id)->textContent;
}
?>

As it has been mentioned in the comments it is usually not a good idea to try to extract things from HTML with regular expressions. If you ever want to switch to a more bulletproof method here's a quick example of how you could easily extract the information using the DOMDocument API.

<?php
function get_vboxview($html) {

    $output = array();

    // Create a new DOM object
    $doc = new DOMDocument;

    // load a string in as html
    $doc->loadHTML($html);

    // create a new Xpath object to query the document with
    $xpath = new DOMXPath($doc);

    // an xpath query that looks for a vboxview node anywhere in the DOM
    // with an attribute named leftinset set to 10, an attribute named rightinset
    // set to 0 and an attribute named stretchiness set to 1
    $query = '//vboxview[@leftinset=10 and @rightinset=0 and @stretchiness=1]';

    // query the document
    $matches = $xpath->query($query);

    // loop through each matching node
    // and the textContent to the output
    foreach ($matches as $m) {
            $output[] = $m->textContent;
    }

    return $output;
}
?>

Better yet if there is guaranteed to be only one vboxview in your input (also assuming you have control of the HTML) you could add an id attribute to vboxview and cut the code down to a shorter and more generalized function.

<?php
function get_node_text($html, $id) {
    // Create a new DOM object
    $doc = new DOMDocument;

    // load a string in as html
    $doc->loadHTML($html);

    // return the textContent of the node with the id $id
    return $doc->getElementById($id)->textContent;
}
?>

回复收藏 0 原文

~没有更多了~