如何获取两个 HTML 标签之间的所有内容? (使用 XPath?)

发布于 2024-12-28 07:55:47 字数 1277 浏览 1 评论 0原文

编辑:我添加了一个适用于这种情况的解决方案。


我想从页面中提取表格,并且我想(可能)使用 DOMDocument 和 XPath 来完成此操作。但如果你有更好的主意,请告诉我。

我的第一次尝试是这样的(显然是错误的,因为它将获得第一个结束表标签):

<?php 
    $tableStart = strpos($source, '<table class="schedule"');
    $tableEnd   = strpos($source, '</table>', $tableStart);
    $rawTable   = substr($source, $tableStart, ($tableEnd - $tableStart));
?>

我很难,这可能可以用 DOMDocument 和/或 xpath 来解决...


最后我想要标签之间的所有内容(在这种情况下) ,标签),以及它们自己的标签。因此,所有 HTML,而不仅仅是值(例如,不仅是“值”,而且是“值”)。还有一个“问题”......

  • 桌子上有其他桌子。因此,如果您只搜索表的末尾(“标签”),您可能会得到错误的标签。
  • 开始标签有一个可以识别它的类(classname = 'schedule')。

这可能吗?

这是我想从另一个网站提取的(简化的)源代码:(我还想显示 html 标签,而不仅仅是值,因此带有“schedule”类的整个表)

<table class="schedule">
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- a problematic tag...

    This could even be variable content. =O =S

</table>

EDIT : I've added a solution which works in this case.


I want to extract a table from a page and I want to do this (probably) with a DOMDocument and XPath. But if you've got a better idea, tell me.

My first attempt was this (obviously faulty, because it will get the first closing table tag):

<?php 
    $tableStart = strpos($source, '<table class="schedule"');
    $tableEnd   = strpos($source, '</table>', $tableStart);
    $rawTable   = substr($source, $tableStart, ($tableEnd - $tableStart));
?>

I tough, this might be solvable with a DOMDocument and/or xpath...


In the end I want everything between the tags (in this case, the tags), and the tags them self. So all HTML, not just the values (e.g. Not just 'Value' but 'Value'). And there is one 'catch'...

  • The table has in it, other tables. So if you just search for the end of the table (' tag') you get probably the wrong tag.
  • The opening tag has a class with which you can identify it (classname = 'schedule').

Is this possible?

This is the (simplified) source piece that I want to extract from another website: (I also want to display the html tags, not just the values, so the whole table with the class 'schedule')

<table class="schedule">
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- a problematic tag...

    This could even be variable content. =O =S

</table>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

辞慾 2025-01-04 07:55:47

首先,请注意,XPath 基于 XML Infopath——一种 XML 模型,其中没有“开始标记”和“结束标记”,但只有节点

因此,人们不应该期望 XPath 表达式选择“标签”——它选择节点

考虑到这一事实,我将这个问题解释为:

我想获取给定“start”之间的所有元素的集合
元素和给定的“结束元素”,包括开始元素和结束元素。

在 XPath 2.0 中,可以使用标准运算符 方便地完成此操作相交

在 XPath 1.0(我假设您正在使用)中,这并不那么容易。解决方案是使用 Kayessian(@Michael Kay)公式进行节点集交集

两个节点集的交集:$ns1$ns2 通过评估以下 XPath 表达式来选择:

$ns1[count(.|$ns2) = count($ns2)]

假设我们有以下 XML 文档(因为您从未提供过):

<html>
    <body>
        <table>
            <tr valign="top">
                <td>
                    <table class="target">
                        <tr>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Starting Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Ending Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                        </tr>
                    </table>
                </td>
            </tr>
        </table>
    </body>
</html>

开始元素由以下方式选择

//table[@class = 'target']
         //td[. = 'Starting Node']

< strong>选择了结束元素by:

//table[@class = 'target']
         //td[. = Ending Node']

为了获得所有想要的节点,我们将以下两个集合相交

  1. 由起始元素和所有后续元素组成的集合(我们将其命名为$vFollowing).

  2. 由结束元素和所有前面的元素组成的集合(我们将其命名为 $vPreceding)。

这些分别通过以下 XPath 表达式进行选择

$vFollowing:

$vStartNode | $vStartNode/following::*

$vPreceding:

$vEndNode | $vEndNode/preceding::*

现在我们可以简单地将 Kayessian 公式应用于节点集 $vFollowing$vPreceding

       $vFollowing
          [count(.|$vPreceding)
          =
           count($vPreceding)
          ]

剩下的就是将所有变量替换为它们各自的表达式。

基于 XSLT 的验证

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vStartNode" select=
 "//table[@class = 'target']//td[. = 'Starting Node']"/>

 <xsl:variable name="vEndNode" select=
 "//table[@class = 'target']//td[. = 'Ending Node']"/>

 <xsl:variable name="vFollowing" select=
 "$vStartNode | $vStartNode/following::*"/>

 <xsl:variable name="vPreceding" select=
 "$vEndNode | $vEndNode/preceding::*"/>

 <xsl:template match="/">
      <xsl:copy-of select=
          "$vFollowing
              [count(.|$vPreceding)
              =
               count($vPreceding)
              ]"/>
 </xsl:template>
</xsl:stylesheet>

当应用于上面的 XML 文档时,将对 XPath 表达式进行求值,并输出所需的、正确的结果选择节点集

<td>Starting Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Ending Node</td>

First of all, do note that XPath is based on the XML Infopath -- a model of XML where there are no "starting tag" and "ending tag" bu there are only nodes

Therfore, one shouldn't expect an XPath expression to select "tags" -- it selects nodes.

Taking this fact into account, I interpret the question as:

I want to obtain the set of all elements that are between a given "start"
element and a given "end element", including the start and end elements.

In XPath 2.0 this can be done conveniently with the standard operator intersect.

In XPath 1.0 (which I assume you are using) this is not so easy. The solution is to use the Kayessian (by @Michael Kay) formula for node-set intersection:

The intersection of two node-sets: $ns1 and $ns2 is selected by evaluating the following XPath expression:

$ns1[count(.|$ns2) = count($ns2)]

Let's assume that we have the following XML document (as you never provided one):

<html>
    <body>
        <table>
            <tr valign="top">
                <td>
                    <table class="target">
                        <tr>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Starting Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Ending Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                        </tr>
                    </table>
                </td>
            </tr>
        </table>
    </body>
</html>

The start-element is selected by:

//table[@class = 'target']
         //td[. = 'Starting Node']

The end-element is selected by:

//table[@class = 'target']
         //td[. = Ending Node']

To obtain all wanted nodes we intersect the following two sets:

  1. The set consisting of the start elementand all following elements (we name this $vFollowing).

  2. The set consisting of the end element and all preceding elements (we name this $vPreceding).

These are selected, respectively by the following XPath expressions:

$vFollowing:

$vStartNode | $vStartNode/following::*

$vPreceding:

$vEndNode | $vEndNode/preceding::*

Now we can simply apply the Kayessian formula on the nodesets $vFollowing and $vPreceding:

       $vFollowing
          [count(.|$vPreceding)
          =
           count($vPreceding)
          ]

What remains is to substitute all variables with their respective expressions.

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vStartNode" select=
 "//table[@class = 'target']//td[. = 'Starting Node']"/>

 <xsl:variable name="vEndNode" select=
 "//table[@class = 'target']//td[. = 'Ending Node']"/>

 <xsl:variable name="vFollowing" select=
 "$vStartNode | $vStartNode/following::*"/>

 <xsl:variable name="vPreceding" select=
 "$vEndNode | $vEndNode/preceding::*"/>

 <xsl:template match="/">
      <xsl:copy-of select=
          "$vFollowing
              [count(.|$vPreceding)
              =
               count($vPreceding)
              ]"/>
 </xsl:template>
</xsl:stylesheet>

when applied on the XML document above, the XPath expressions are evaluated and the wanted, correct resulting-selected node-set is output:

<td>Starting Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Ending Node</td>
尘曦 2025-01-04 07:55:47

不要使用正则表达式(或 strpos...)来解析 HTML!

这个问题对您来说很困难的部分原因是您正在考虑“标签”而不是“节点”或“元素”。标签是序列化的产物。 (HTML 有可选的结束标记。)节点是实际的数据结构。 DOMDocument 没有“标签”,只有以正确的树结构排列的“节点”。

以下是使用 XPath 获取表的方法:

// This is a simple solution, but only works if the value of "class" attribute is exactly "schedule"
// $xpath = '//table[@class="schedule"]';

// This is what you want. It is equivalent to the "table.schedule" css selector:
$xpath = "//table[contains(concat(' ',normalize-space(@class),' '),' schedule ')]";

$d = new DOMDocument();
$d->loadHTMLFile('http://example.org');
$xp = new DOMXPath($d);
$tables = $xp->query($xpath);
foreach ($tables as $table) {
    $table; // this is a DOMElement of a table with class="schedule"; It includes all nodes which are children of it.
}

Do not use regexes (or strpos...) to parse HTML!

Part of why this problem was difficult for you is you are thinking in "tags" instead of "nodes" or "elements". Tags are an artifact of serialization. (HTML has optional end tags.) Nodes are the actual data structure. A DOMDocument has no "tags", only "nodes" arranged in the proper tree structure.

Here is how you get your table with XPath:

// This is a simple solution, but only works if the value of "class" attribute is exactly "schedule"
// $xpath = '//table[@class="schedule"]';

// This is what you want. It is equivalent to the "table.schedule" css selector:
$xpath = "//table[contains(concat(' ',normalize-space(@class),' '),' schedule ')]";

$d = new DOMDocument();
$d->loadHTMLFile('http://example.org');
$xp = new DOMXPath($d);
$tables = $xp->query($xpath);
foreach ($tables as $table) {
    $table; // this is a DOMElement of a table with class="schedule"; It includes all nodes which are children of it.
}
橘亓 2025-01-04 07:55:47

如果您有像这样格式良好的 HTML,

<html>
<body>
    <table>
        <tr valign='top'>
            <td>
                <table class='inner'>
                    <tr><td>Inner Table</td></tr>
                </table>
            </td>
            <td>
                <table class='second inner'>
                    <tr><td>Second  Inner</td></tr>
                </table>
            </td>
        </tr>
    </table>
</body>
</html>

请使用此 pho 代码输出节点(在 xml 包装器中)

<?php
    $xml = new DOMDocument();
    $strFileName = "t.xml";
    $xml->load($strFileName);

    $xmlCopy = new DOMDocument();
    $xmlCopy->loadXML( "<xml/>" ); 

    $xpath = new domxpath( $xml );
    $strXPath = "//table[@class='inner']";

    $elements = $xpath->query( $strXPath, $xml );
    foreach( $elements as $element ) {
        $ndTemp = $xmlCopy->importNode( $element, true );
        $xmlCopy->documentElement->appendChild( $ndTemp );
    }
    echo $xmlCopy->saveXML();
?>

If you have well formed HTML like this

<html>
<body>
    <table>
        <tr valign='top'>
            <td>
                <table class='inner'>
                    <tr><td>Inner Table</td></tr>
                </table>
            </td>
            <td>
                <table class='second inner'>
                    <tr><td>Second  Inner</td></tr>
                </table>
            </td>
        </tr>
    </table>
</body>
</html>

Output the nodes (in an xml wrapper) with this pho code

<?php
    $xml = new DOMDocument();
    $strFileName = "t.xml";
    $xml->load($strFileName);

    $xmlCopy = new DOMDocument();
    $xmlCopy->loadXML( "<xml/>" ); 

    $xpath = new domxpath( $xml );
    $strXPath = "//table[@class='inner']";

    $elements = $xpath->query( $strXPath, $xml );
    foreach( $elements as $element ) {
        $ndTemp = $xmlCopy->importNode( $element, true );
        $xmlCopy->documentElement->appendChild( $ndTemp );
    }
    echo $xmlCopy->saveXML();
?>
樱花落人离去 2025-01-04 07:55:47

这样就得到了整个表。但可以对其进行修改,让它抓取另一个标签。这是一个针对具体情况的解决方案,只能在特定情况下使用。如果 html、php 或 css 注释包含开始或结束标记,则中断。请谨慎使用。

功能:

// **********************************************************************************
// Gets a whole html tag with its contents.
//  - Source should be a well formatted html string (get it with file_get_contents or cURL)
//  - You CAN provide a custom startTag with in it e.g. an id or something else (<table style='border:0;')
//    This is recommended if it is not the only p/table/h2/etc. tag in the script.
//  - Ignores closing tags if there is an opening tag of the same sort you provided. Got it?
function getTagWithContents($source, $tag, $customStartTag = false)
{

    $startTag = '<'.$tag;
    $endTag   = '</'.$tag.'>';

    $startTagLength = strlen($startTag);
    $endTagLength   = strlen($endTag);

//      ***************************** 
    if ($customStartTag)
        $gotStartTag = strpos($source, $customStartTag);
    else
        $gotStartTag = strpos($source, $startTag);

    // Can't find it?
    if (!$gotStartTag)
        return false;       
    else
    {

//      ***************************** 

        // This is the hard part: finding the correct closing tag position.
        // <table class="schedule">
        //     <table>
        //     </table> <-- Not this one
        // </table> <-- But this one

        $foundIt          = false;
        $locationInScript = $gotStartTag;
        $startPosition    = $gotStartTag;

        // Checks if there is an opening tag before the start tag.
        while ($foundIt == false)
        {
            $gotAnotherStart = strpos($source, $startTag, $locationInScript + $startTagLength);
            $endPosition        = strpos($source, $endTag,   $locationInScript + $endTagLength);

            // If it can find another opening tag before the closing tag, skip that closing tag.
            if ($gotAnotherStart && $gotAnotherStart < $endPosition)
            {               
                $locationInScript = $endPosition;
            }
            else
            {
                $foundIt  = true;
                $endPosition = $endPosition + $endTagLength;
            }
        }

//      ***************************** 

        // cut the piece from its source and return it.
        return substr($source, $startPosition, ($endPosition - $startPosition));

    } 
}

功能的应用:

$gotTable = getTagWithContents($tableData, 'table', '<table class="schedule"');
if (!$gotTable)
{
    $error = 'Faild to log in or to get the tag';
}
else
{
    //Do something you want to do with it, e.g. display it or clean it...
    $cleanTable = preg_replace('|href=\'(.*)\'|', '', $gotTable);
    $cleanTable = preg_replace('|TITLE="(.*)"|', '', $cleanTable);
}

在上面你可以找到我对我的问题的最终解决方案。在旧的解决方案下面,我制作了一个通用的函数。

旧解决方案:

// Try to find the table and remember its starting position. Check for succes.
// No success means the user is not logged in.
$gotTableStart = strpos($source, '<table class="schedule"');
if (!$gotTableStart)
{
    $err = 'Can\'t find the table start';
}
else
{

//      ***************************** 
    // This is the hard part: finding the closing tag.
    $foundIt          = false;
    $locationInScript = $gotTableStart;
    $tableStart       = $gotTableStart;

    while ($foundIt == false)
    {
        $innerTablePos = strpos($source, '<table', $locationInScript + 6);
        $tableEnd      = strpos($source, '</table>', $locationInScript + 7);

        // If it can find '<table' before '</table>' skip that closing tag.
        if ($innerTablePos != false && $innerTablePos < $tableEnd)
        {               
            $locationInScript = $tableEnd;
        }
        else
        {
            $foundIt  = true;
            $tableEnd = $tableEnd + 8;
        }
    }

//      ***************************** 

    // Clear the table from links and popups...
    $rawTable   = substr($tableData, $tableStart, ($tableEnd - $tableStart));

} 

This gets the whole table. But it can be modified to let it grab another tag. This is quite a case specific solution which can only be used onder specific circumstances. Breaks if html, php or css comments containt the opening or closing tag. Use it with caution.

Function:

// **********************************************************************************
// Gets a whole html tag with its contents.
//  - Source should be a well formatted html string (get it with file_get_contents or cURL)
//  - You CAN provide a custom startTag with in it e.g. an id or something else (<table style='border:0;')
//    This is recommended if it is not the only p/table/h2/etc. tag in the script.
//  - Ignores closing tags if there is an opening tag of the same sort you provided. Got it?
function getTagWithContents($source, $tag, $customStartTag = false)
{

    $startTag = '<'.$tag;
    $endTag   = '</'.$tag.'>';

    $startTagLength = strlen($startTag);
    $endTagLength   = strlen($endTag);

//      ***************************** 
    if ($customStartTag)
        $gotStartTag = strpos($source, $customStartTag);
    else
        $gotStartTag = strpos($source, $startTag);

    // Can't find it?
    if (!$gotStartTag)
        return false;       
    else
    {

//      ***************************** 

        // This is the hard part: finding the correct closing tag position.
        // <table class="schedule">
        //     <table>
        //     </table> <-- Not this one
        // </table> <-- But this one

        $foundIt          = false;
        $locationInScript = $gotStartTag;
        $startPosition    = $gotStartTag;

        // Checks if there is an opening tag before the start tag.
        while ($foundIt == false)
        {
            $gotAnotherStart = strpos($source, $startTag, $locationInScript + $startTagLength);
            $endPosition        = strpos($source, $endTag,   $locationInScript + $endTagLength);

            // If it can find another opening tag before the closing tag, skip that closing tag.
            if ($gotAnotherStart && $gotAnotherStart < $endPosition)
            {               
                $locationInScript = $endPosition;
            }
            else
            {
                $foundIt  = true;
                $endPosition = $endPosition + $endTagLength;
            }
        }

//      ***************************** 

        // cut the piece from its source and return it.
        return substr($source, $startPosition, ($endPosition - $startPosition));

    } 
}

Application of the function:

$gotTable = getTagWithContents($tableData, 'table', '<table class="schedule"');
if (!$gotTable)
{
    $error = 'Faild to log in or to get the tag';
}
else
{
    //Do something you want to do with it, e.g. display it or clean it...
    $cleanTable = preg_replace('|href=\'(.*)\'|', '', $gotTable);
    $cleanTable = preg_replace('|TITLE="(.*)"|', '', $cleanTable);
}

Above you can find my final solution to my problem. Below the old solution out of which I made a function for universal use.

Old solution:

// Try to find the table and remember its starting position. Check for succes.
// No success means the user is not logged in.
$gotTableStart = strpos($source, '<table class="schedule"');
if (!$gotTableStart)
{
    $err = 'Can\'t find the table start';
}
else
{

//      ***************************** 
    // This is the hard part: finding the closing tag.
    $foundIt          = false;
    $locationInScript = $gotTableStart;
    $tableStart       = $gotTableStart;

    while ($foundIt == false)
    {
        $innerTablePos = strpos($source, '<table', $locationInScript + 6);
        $tableEnd      = strpos($source, '</table>', $locationInScript + 7);

        // If it can find '<table' before '</table>' skip that closing tag.
        if ($innerTablePos != false && $innerTablePos < $tableEnd)
        {               
            $locationInScript = $tableEnd;
        }
        else
        {
            $foundIt  = true;
            $tableEnd = $tableEnd + 8;
        }
    }

//      ***************************** 

    // Clear the table from links and popups...
    $rawTable   = substr($tableData, $tableStart, ($tableEnd - $tableStart));

} 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文