使用 PHP 和 XPath 进行屏幕抓取
有谁知道使用 XPath 提取数据时如何保持文本格式?
我目前正在提取所有块
标题
文本锚
来自页面。 问题是当我访问nodeValue时,我只能得到纯文本。 如何捕获包括格式在内的内容,即代码中的 h5 和 still?
提前致谢。 我在谷歌上搜索了所有能想到的组合,但没有成功。
Does anyone know how to maintain text formatting when using XPath to extract data?
I am currently extracting all blocks
<div class="info">
<h5>title</h5>
text <a href="somelink">anchor</a>
</div>
from a page. The problem is when I access the nodeValue, I can only get plain text. How can I capture the contents including formatting, i.e. the h5 and a still in the code?
Thanks in advance. I have searched every combination imaginable on Google and no luck.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您将它作为 DomElement $element 作为 DomDocument $dom 的一部分,那么您将需要执行以下操作:
元素的 NodeValue 实际上是文本值,而不是结构化 XML。
If you have it as a DomElement $element as part of a DomDocument $dom then you will want to do something like:
The NodeValue of an element is really the textual value, not the structured XML.
我想添加 Ciaran McNulty 的答案,
您可以在 SimpleXml 中执行相同的操作,例如:
并扩展引用
您可以将您的节点视为如下:
对
$element->nodeValue
的调用就像调用$element->__toString()
一样,它只会获取 __toString () 元素。 我创建的虚构的__toString()
被正式定义为XML_TEXT_NODE
。I would like to add to Ciaran McNulty answer
You can do the same in SimpleXml like:
And to expand on the quote
You can think of your node as follows:
Where the call to
$element->nodeValue
is like calling$element->__toString()
which would only get the __toString() elements. The imaginary__toString()
I created is officially defined as anXML_TEXT_NODE
.XPath 语言旨在嵌入另一种语言(例如 DOM API、 XSLT,XQuery,...)并且不能独立使用。 原始问题没有指定所需的嵌入是什么。
下面是当 XPath 嵌入到 XSLT 中时非常简单且简短的解决方案。
此转换:
应用于此 xml 文档时:
产生所需结果:
The XPath language is designed to be embedded in another language (such as DOM API, XSLT, XQuery, ...) and cannot be used standalone. The original question does not specify what is the desired embedding.
Below is a very simple and short solution when XPath is embedded in XSLT.
This transformation:
when applied on this xml document:
produces the wanted result:
您需要确保您的 xpath 查询在
处“结束”。 然而,由于 XPath 的工作方式,您仍然会在单独的节点中获得所有“子标签”。 您只需要连接它们即可。
不过,您也可以使用 XPath 的 join 功能,因为我还没有使用过它,我不能说你可能会遇到什么问题。
You'll need to make sure your xpath query 'ends' at the
<div class="info">
. However, because of the way XPath works, you'll still get all of the 'subtags' in separate nodes. You'll just need to concatenate them.You could also use XPath's join functionality, though, as I haven't used it, I can't say what problems you might run in to.
div/node()
应该可以解决问题。输入示例:
XSLT 样式表示例:
输出示例:
div/node()
should do the trick.Example input:
Example XSLT stylesheet:
Example output: