使用 PHP 和 XPath 进行屏幕抓取

发布于 2024-07-12 01:59:36 字数 278 浏览 9 评论 0原文

有谁知道使用 XPath 提取数据时如何保持文本格式?

我目前正在提取所有块

标题
文本

来自页面。 问题是当我访问nodeValue时,我只能得到纯文本。 如何捕获包括格式在内的内容,即代码中的 h5 和 still?

提前致谢。 我在谷歌上搜索了所有能想到的组合,但没有成功。

Does anyone know how to maintain text formatting when using XPath to extract data?

I am currently extracting all blocks


<div class="info">
<h5>title</h5>
text <a href="somelink">anchor</a>
</div>

from a page. The problem is when I access the nodeValue, I can only get plain text. How can I capture the contents including formatting, i.e. the h5 and a still in the code?

Thanks in advance. I have searched every combination imaginable on Google and no luck.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

新雨望断虹 2024-07-19 01:59:36

如果您将它作为 DomElement $element 作为 DomDocument $dom 的一部分,那么您将需要执行以下操作:

$string = $dom->saveXml($element);

元素的 NodeValue 实际上是文本值,而不是结构化 XML。

If you have it as a DomElement $element as part of a DomDocument $dom then you will want to do something like:

$string = $dom->saveXml($element);

The NodeValue of an element is really the textual value, not the structured XML.

黯淡〆 2024-07-19 01:59:36

我想添加 Ciaran McNulty 的答案,

您可以在 SimpleXml 中执行相同的操作,例如:

$simplexml->node->asXml(); // saveXml() is now an alias

并扩展引用

元素的 NodeValue 实际上是文本值,而不是结构化 XML。

您可以将您的节点视为如下:

<div class="info">
    <__toString()> </__toString()>
    <h5>title</h5>
    <__toString()> text </__toString()>
    <a href="somelink">anchor</a>
    <__toString()> </__toString()>
</div>

$element->nodeValue 的调用就像调用 $element->__toString() 一样,它只会获取 __toString () 元素。 我创建的虚构的 __toString() 被正式定义为 XML_TEXT_NODE

I would like to add to Ciaran McNulty answer

You can do the same in SimpleXml like:

$simplexml->node->asXml(); // saveXml() is now an alias

And to expand on the quote

The NodeValue of an element is really the textual value, not the structured XML.

You can think of your node as follows:

<div class="info">
    <__toString()> </__toString()>
    <h5>title</h5>
    <__toString()> text </__toString()>
    <a href="somelink">anchor</a>
    <__toString()> </__toString()>
</div>

Where the call to $element->nodeValue is like calling $element->__toString() which would only get the __toString() elements. The imaginary __toString() I created is officially defined as an XML_TEXT_NODE.

如此安好 2024-07-19 01:59:36

XPath 语言旨在嵌入另一种语言(例如 DOM API、 XSLT,XQuery,...)并且不能独立使用。 原始问题没有指定所需的嵌入是什么。

下面是当 XPath 嵌入到 XSLT 中时非常简单且简短的解决方案

此转换

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="div[@class='info']">
       <xsl:copy-of select="."/>
    </xsl:template>
</xsl:stylesheet>

应用于此 xml 文档时

<html>
    <body>
        <div class="info">
            <h1>title1</h1> text1
            <a href="somelink1">anchor1</a>
        </div>
        Something else here
        <div class="info">
            <h2>title2</h2> text2
            <a href="somelink2">anchor2</a>
        </div>
        Something else here
        <div class="info">
            <h3>title3</h3> text3
            <a href="somelink3">anchor3</a>
        </div>
    </body>
</html>

产生所需结果

<div class="info">
  <h1>title1</h1> text1
    <a href="somelink1">anchor1</a>
</div>
        Something else here
<div class="info">
  <h2>title2</h2> text2
  <a href="somelink2">anchor2</a>
</div>
        Something else here
<div class="info">
  <h3>title3</h3> text3
  <a href="somelink3">anchor3</a>
</div>

The XPath language is designed to be embedded in another language (such as DOM API, XSLT, XQuery, ...) and cannot be used standalone. The original question does not specify what is the desired embedding.

Below is a very simple and short solution when XPath is embedded in XSLT.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="div[@class='info']">
       <xsl:copy-of select="."/>
    </xsl:template>
</xsl:stylesheet>

when applied on this xml document:

<html>
    <body>
        <div class="info">
            <h1>title1</h1> text1
            <a href="somelink1">anchor1</a>
        </div>
        Something else here
        <div class="info">
            <h2>title2</h2> text2
            <a href="somelink2">anchor2</a>
        </div>
        Something else here
        <div class="info">
            <h3>title3</h3> text3
            <a href="somelink3">anchor3</a>
        </div>
    </body>
</html>

produces the wanted result:

<div class="info">
  <h1>title1</h1> text1
    <a href="somelink1">anchor1</a>
</div>
        Something else here
<div class="info">
  <h2>title2</h2> text2
  <a href="somelink2">anchor2</a>
</div>
        Something else here
<div class="info">
  <h3>title3</h3> text3
  <a href="somelink3">anchor3</a>
</div>
久隐师 2024-07-19 01:59:36

您需要确保您的 xpath 查询在

处“结束”。 然而,由于 XPath 的工作方式,您仍然会在单独的节点中获得所有“子标签”。 您只需要连接它们即可。

不过,您也可以使用 XPath 的 join 功能,因为我还没有使用过它,我不能说你可能会遇到什么问题。

You'll need to make sure your xpath query 'ends' at the <div class="info">. However, because of the way XPath works, you'll still get all of the 'subtags' in separate nodes. You'll just need to concatenate them.

You could also use XPath's join functionality, though, as I haven't used it, I can't say what problems you might run in to.

娇柔作态 2024-07-19 01:59:36

div/node() 应该可以解决问题。

输入示例:

<div class="info">
  some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>

XSLT 样式表示例:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
        <newtag>
                <xsl:copy-of select="div/node()"/>
        </newtag>
</xsl:template>

</xsl:stylesheet>

输出示例:

<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>

div/node() should do the trick.

Example input:

<div class="info">
  some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>

Example XSLT stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
        <newtag>
                <xsl:copy-of select="div/node()"/>
        </newtag>
</xsl:template>

</xsl:stylesheet>

Example output:

<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文