当前位置：文江博客话题详情

使用 PHP 和 XPath 进行屏幕抓取

发布于 2024-07-12 01:59:36 字数 278 浏览 9 评论 0原文

有谁知道使用 XPath 提取数据时如何保持文本格式？

我目前正在提取所有块

`标题`

文本锚

来自页面。问题是当我访问nodeValue时，我只能得到纯文本。如何捕获包括格式在内的内容，即代码中的 h5 和 still？

提前致谢。我在谷歌上搜索了所有能想到的组合，但没有成功。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

新雨望断虹 2024-07-19 01:59:36

如果您将它作为 DomElement $element 作为 DomDocument $dom 的一部分，那么您将需要执行以下操作：

$string = $dom->saveXml($element);

元素的 NodeValue 实际上是文本值，而不是结构化 XML。

If you have it as a DomElement $element as part of a DomDocument $dom then you will want to do something like:

$string = $dom->saveXml($element);

The NodeValue of an element is really the textual value, not the structured XML.

回复收藏 0 原文

黯淡〆 2024-07-19 01:59:36

我想添加 Ciaran McNulty 的答案，

您可以在 SimpleXml 中执行相同的操作，例如：

$simplexml->node->asXml(); // saveXml() is now an alias

并扩展引用

元素的 NodeValue 实际上是文本值，而不是结构化 XML。

您可以将您的节点视为如下：

<div class="info">
    <__toString()> </__toString()>
    <h5>title</h5>
    <__toString()> text </__toString()>
    <a href="somelink">anchor</a>
    <__toString()> </__toString()>
</div>

对 $element->nodeValue 的调用就像调用 $element->__toString() 一样，它只会获取 __toString () 元素。我创建的虚构的 __toString() 被正式定义为 XML_TEXT_NODE。

I would like to add to Ciaran McNulty answer

You can do the same in SimpleXml like:

$simplexml->node->asXml(); // saveXml() is now an alias

And to expand on the quote

The NodeValue of an element is really the textual value, not the structured XML.

You can think of your node as follows:

<div class="info">
    <__toString()> </__toString()>
    <h5>title</h5>
    <__toString()> text </__toString()>
    <a href="somelink">anchor</a>
    <__toString()> </__toString()>
</div>

Where the call to $element->nodeValue is like calling $element->__toString() which would only get the __toString() elements. The imaginary __toString() I created is officially defined as an XML_TEXT_NODE.

回复收藏 0 原文

如此安好 2024-07-19 01:59:36

XPath 语言旨在嵌入另一种语言（例如 DOM API、 XSLT，XQuery，...）并且不能独立使用。原始问题没有指定所需的嵌入是什么。

下面是当 XPath 嵌入到 XSLT 中时非常简单且简短的解决方案。

此转换：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="div[@class='info']">
       <xsl:copy-of select="."/>
    </xsl:template>
</xsl:stylesheet>

应用于此 xml 文档时：

<html>
    <body>
        <div class="info">
            <h1>title1</h1> text1
            <a href="somelink1">anchor1</a>
        </div>
        Something else here
        <div class="info">
            <h2>title2</h2> text2
            <a href="somelink2">anchor2</a>
        </div>
        Something else here
        <div class="info">
            <h3>title3</h3> text3
            <a href="somelink3">anchor3</a>
        </div>
    </body>
</html>

产生所需结果：

<div class="info">
  <h1>title1</h1> text1
    <a href="somelink1">anchor1</a>
</div>
        Something else here
<div class="info">
  <h2>title2</h2> text2
  <a href="somelink2">anchor2</a>
</div>
        Something else here
<div class="info">
  <h3>title3</h3> text3
  <a href="somelink3">anchor3</a>
</div>

The XPath language is designed to be embedded in another language (such as DOM API, XSLT, XQuery, ...) and cannot be used standalone. The original question does not specify what is the desired embedding.

Below is a very simple and short solution when XPath is embedded in XSLT.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="div[@class='info']">
       <xsl:copy-of select="."/>
    </xsl:template>
</xsl:stylesheet>

when applied on this xml document:

<html>
    <body>
        <div class="info">
            <h1>title1</h1> text1
            <a href="somelink1">anchor1</a>
        </div>
        Something else here
        <div class="info">
            <h2>title2</h2> text2
            <a href="somelink2">anchor2</a>
        </div>
        Something else here
        <div class="info">
            <h3>title3</h3> text3
            <a href="somelink3">anchor3</a>
        </div>
    </body>
</html>

produces the wanted result:

<div class="info">
  <h1>title1</h1> text1
    <a href="somelink1">anchor1</a>
</div>
        Something else here
<div class="info">
  <h2>title2</h2> text2
  <a href="somelink2">anchor2</a>
</div>
        Something else here
<div class="info">
  <h3>title3</h3> text3
  <a href="somelink3">anchor3</a>
</div>

回复收藏 0 原文

久隐师 2024-07-19 01:59:36

您需要确保您的 xpath 查询在

处“结束”。然而，由于 XPath 的工作方式，您仍然会在单独的节点中获得所有“子标签”。您只需要连接它们即可。

不过，您也可以使用 XPath 的 join 功能，因为我还没有使用过它，我不能说你可能会遇到什么问题。

回复收藏 0 原文

娇柔作态 2024-07-19 01:59:36

div/node() 应该可以解决问题。

输入示例：

<div class="info">
  some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>

XSLT 样式表示例：

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
        <newtag>
                <xsl:copy-of select="div/node()"/>
        </newtag>
</xsl:template>

</xsl:stylesheet>

输出示例：

<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>

div/node() should do the trick.

Example input:

<div class="info">
  some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>

Example XSLT stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
        <newtag>
                <xsl:copy-of select="div/node()"/>
        </newtag>
</xsl:template>

</xsl:stylesheet>

Example output:

<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>

回复收藏 0 原文

~没有更多了~

关于作者

ゞ花落谁相伴

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

使用 PHP 和 XPath 进行屏幕抓取

`标题`

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

使用 PHP 和 XPath 进行屏幕抓取

标题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

謌踐踏愛綪

开始看清了

高速公鹿

alipaysp_PLnULTzf66

热情消退

白色月光

友情链接

`标题`

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。