如何获取两个 HTML 标签之间的所有内容? (使用 XPath?)
编辑:我添加了一个适用于这种情况的解决方案。
我想从页面中提取表格,并且我想(可能)使用 DOMDocument 和 XPath 来完成此操作。但如果你有更好的主意,请告诉我。
我的第一次尝试是这样的(显然是错误的,因为它将获得第一个结束表标签):
<?php
$tableStart = strpos($source, '<table class="schedule"');
$tableEnd = strpos($source, '</table>', $tableStart);
$rawTable = substr($source, $tableStart, ($tableEnd - $tableStart));
?>
我很难,这可能可以用 DOMDocument 和/或 xpath 来解决...
最后我想要标签之间的所有内容(在这种情况下) ,标签),以及它们自己的标签。因此,所有 HTML,而不仅仅是值(例如,不仅是“值”,而且是“值”)。还有一个“问题”......
- 桌子上有其他桌子。因此,如果您只搜索表的末尾(“标签”),您可能会得到错误的标签。
- 开始标签有一个可以识别它的类(classname = 'schedule')。
这可能吗?
这是我想从另一个网站提取的(简化的)源代码:(我还想显示 html 标签,而不仅仅是值,因此带有“schedule”类的整个表)
<table class="schedule">
<table class="annoying nested table">
Lots of table rows, etc.
</table> <-- The problematic tag...
<table class="annoying nested table">
Lots of table rows, etc.
</table> <-- The problematic tag...
<table class="annoying nested table">
Lots of table rows, etc.
</table> <-- a problematic tag...
This could even be variable content. =O =S
</table>
EDIT : I've added a solution which works in this case.
I want to extract a table from a page and I want to do this (probably) with a DOMDocument and XPath. But if you've got a better idea, tell me.
My first attempt was this (obviously faulty, because it will get the first closing table tag):
<?php
$tableStart = strpos($source, '<table class="schedule"');
$tableEnd = strpos($source, '</table>', $tableStart);
$rawTable = substr($source, $tableStart, ($tableEnd - $tableStart));
?>
I tough, this might be solvable with a DOMDocument and/or xpath...
In the end I want everything between the tags (in this case, the tags), and the tags them self. So all HTML, not just the values (e.g. Not just 'Value' but 'Value'). And there is one 'catch'...
- The table has in it, other tables. So if you just search for the end of the table (' tag') you get probably the wrong tag.
- The opening tag has a class with which you can identify it (classname = 'schedule').
Is this possible?
This is the (simplified) source piece that I want to extract from another website: (I also want to display the html tags, not just the values, so the whole table with the class 'schedule')
<table class="schedule">
<table class="annoying nested table">
Lots of table rows, etc.
</table> <-- The problematic tag...
<table class="annoying nested table">
Lots of table rows, etc.
</table> <-- The problematic tag...
<table class="annoying nested table">
Lots of table rows, etc.
</table> <-- a problematic tag...
This could even be variable content. =O =S
</table>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
首先,请注意,XPath 基于 XML Infopath——一种 XML 模型,其中没有“开始标记”和“结束标记”,但只有节点
因此,人们不应该期望 XPath 表达式选择“标签”——它选择节点。
考虑到这一事实,我将这个问题解释为:
在 XPath 2.0 中,可以使用标准运算符 方便地完成此操作相交。
在 XPath 1.0(我假设您正在使用)中,这并不那么容易。解决方案是使用 Kayessian(@Michael Kay)公式进行节点集交集:
两个节点集的交集:
$ns1
和$ns2
通过评估以下 XPath 表达式来选择:假设我们有以下 XML 文档(因为您从未提供过):
开始元素由以下方式选择:
< strong>选择了结束元素by:
为了获得所有想要的节点,我们将以下两个集合相交:
由起始元素和所有后续元素组成的集合(我们将其命名为
$vFollowing
).由结束元素和所有前面的元素组成的集合(我们将其命名为
$vPreceding
)。这些分别通过以下 XPath 表达式进行选择:
$vFollowing:
$vPreceding:
现在我们可以简单地将 Kayessian 公式应用于节点集
$vFollowing
和$vPreceding
:剩下的就是将所有变量替换为它们各自的表达式。
基于 XSLT 的验证:
当应用于上面的 XML 文档时,将对 XPath 表达式进行求值,并输出所需的、正确的结果选择节点集:
First of all, do note that XPath is based on the XML Infopath -- a model of XML where there are no "starting tag" and "ending tag" bu there are only nodes
Therfore, one shouldn't expect an XPath expression to select "tags" -- it selects nodes.
Taking this fact into account, I interpret the question as:
In XPath 2.0 this can be done conveniently with the standard operator intersect.
In XPath 1.0 (which I assume you are using) this is not so easy. The solution is to use the Kayessian (by @Michael Kay) formula for node-set intersection:
The intersection of two node-sets:
$ns1
and$ns2
is selected by evaluating the following XPath expression:Let's assume that we have the following XML document (as you never provided one):
The start-element is selected by:
The end-element is selected by:
To obtain all wanted nodes we intersect the following two sets:
The set consisting of the start elementand all following elements (we name this
$vFollowing
).The set consisting of the end element and all preceding elements (we name this
$vPreceding
).These are selected, respectively by the following XPath expressions:
$vFollowing:
$vPreceding:
Now we can simply apply the Kayessian formula on the nodesets
$vFollowing
and$vPreceding
:What remains is to substitute all variables with their respective expressions.
XSLT - based verification:
when applied on the XML document above, the XPath expressions are evaluated and the wanted, correct resulting-selected node-set is output:
不要使用正则表达式(或
strpos
...)来解析 HTML!这个问题对您来说很困难的部分原因是您正在考虑“标签”而不是“节点”或“元素”。标签是序列化的产物。 (HTML 有可选的结束标记。)节点是实际的数据结构。 DOMDocument 没有“标签”,只有以正确的树结构排列的“节点”。
以下是使用 XPath 获取表的方法:
Do not use regexes (or
strpos
...) to parse HTML!Part of why this problem was difficult for you is you are thinking in "tags" instead of "nodes" or "elements". Tags are an artifact of serialization. (HTML has optional end tags.) Nodes are the actual data structure. A DOMDocument has no "tags", only "nodes" arranged in the proper tree structure.
Here is how you get your table with XPath:
如果您有像这样格式良好的 HTML,
请使用此 pho 代码输出节点(在 xml 包装器中)
If you have well formed HTML like this
Output the nodes (in an xml wrapper) with this pho code
这样就得到了整个表。但可以对其进行修改,让它抓取另一个标签。这是一个针对具体情况的解决方案,只能在特定情况下使用。如果 html、php 或 css 注释包含开始或结束标记,则中断。请谨慎使用。
功能:
功能的应用:
在上面你可以找到我对我的问题的最终解决方案。在旧的解决方案下面,我制作了一个通用的函数。
旧解决方案:
This gets the whole table. But it can be modified to let it grab another tag. This is quite a case specific solution which can only be used onder specific circumstances. Breaks if html, php or css comments containt the opening or closing tag. Use it with caution.
Function:
Application of the function:
Above you can find my final solution to my problem. Below the old solution out of which I made a function for universal use.
Old solution: