XPath 节点到字符串

发布于 2024-09-12 17:19:20 字数 989 浏览 7 评论 0原文

我如何选择以下节点的字符串内容:

<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>

我尝试了一些方法

//span/text()

没有得到粗体标记

//span/string(.)

无效

string(//span)

仅选择1个节点

我在php中使用simple_xml,我认为唯一的其他选项是使用//span它返回:

Array
(
    [0] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test
        )

    [1] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test2
        )

)

*请注意,它还会从第二个跨度中删除“更多单词”文本。

所以我想我可以如何使用 php 来压平数组中的项目? Xpath 是首选,但任何其他想法也会有帮助。

How can I select the string contents of the following nodes:

<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>

I have tried a few things

//span/text()

Doesn't get the bold tag

//span/string(.)

is invalid

string(//span)

only selects 1 node

I am using simple_xml in php and the only other option I think is to use //span which returns:

Array
(
    [0] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test
        )

    [1] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test2
        )

)

*note that it is also dropping the "more words" text from the second span.

So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

芯好空 2024-09-19 17:19:20
$xml = '<foo>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[@class='url']") as $node) echo $node->textContent;
$xml = '<foo>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[@class='url']") as $node) echo $node->textContent;
红玫瑰 2024-09-19 17:19:20

您甚至不需要 XPath:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
    if(in_array('url', explode(' ', $span->getAttribute('class')))) {
        $span->nodeValue = $span->textContent;
    }
}
echo $dom->saveHTML();

在下面的注释后编辑

如果您只想获取字符串,您可以执行 echo $span->textContent; 而不是替换节点值。我知道您想要为跨度使用一个字符串,而不是嵌套结构。在这种情况下,您还应该考虑在跨度代码段上简单运行 strip_tags 是否不是更快、更简单的替代方案。


使用 PHP5.3,您还可以注册任意 PHP 函数以用作 XPath 查询中的回调。以下代码将获取所有 span 元素及其子节点的内容,并将其作为单个字符串返回。

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');

// Custom Callback function
function nodeTextJoin($nodes)
{
    $text = '';
    foreach($nodes as $node) {
        $text .= $node->textContent;
    }
    return $text;
}

You dont even need an XPath for this:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
    if(in_array('url', explode(' ', $span->getAttribute('class')))) {
        $span->nodeValue = $span->textContent;
    }
}
echo $dom->saveHTML();

EDIT after comment below

If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.


With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');

// Custom Callback function
function nodeTextJoin($nodes)
{
    $text = '';
    foreach($nodes as $node) {
        $text .= $node->textContent;
    }
    return $text;
}
与之呼应 2024-09-19 17:19:20

使用 XMLReader:

$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
    if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
        echo $xmlr->readString();
    }
}

输出:

word
test

word
test2
more words

Using XMLReader:

$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
    if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
        echo $xmlr->readString();
    }
}

Output:

word
test

word
test2
more words
忘年祭陌 2024-09-19 17:19:20

SimpleXML 不喜欢将文本节点与其他元素混合,这就是您丢失一些内容的原因。然而,DOM 扩展可以很好地处理这个问题。幸运的是,DOM 和 SimpleXML 是同一枚硬币 (libxml) 的两个面,因此很容易混合使用它们。例如:

foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
    // will not work as expected
    echo $span;

    // will work as expected
    echo textContent($span);
}

function textContent(SimpleXMLElement $node)
{
    return dom_import_simplexml($node)->textContent;
}

SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:

foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
    // will not work as expected
    echo $span;

    // will work as expected
    echo textContent($span);
}

function textContent(SimpleXMLElement $node)
{
    return dom_import_simplexml($node)->textContent;
}
看透却不说透 2024-09-19 17:19:20
//span//text()

这可能是你能做的最好的事情了。您将获得多个文本节点,因为文本存储在 DOM 中的单独节点中。如果您想要单个字符串,则必须自己连接文本节点,因为我想不出一种方法来让内置 XPath 函数来完成此操作。

使用 string()concat() 不起作用,因为这些函数需要字符串参数。当您将节点集传递给需要字符串的函数时,节点集将通过获取节点集中第一个节点的文本内容转换为字符串。其余节点将被丢弃。

//span//text()

This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.

Using string() or concat() won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.

一个人练习一个人 2024-09-19 17:19:20

如何选择字符串内容
以下节点:

首先,我认为你的问题没有表述清楚。

您可以选择后代文本节点,因为 John Kugelman 的回答是

//span//text()

我建议使用绝对路径(不以 // 开头)

但是您需要处理从父级 span 中找到它们是子级的文本节点。因此,最好只选择 span 元素(例如 //span),然后处理其字符串值。

使用 XPath 2.0,您可以使用:

string-join(//span, '.')

结果:

word test. word test2 more words

使用 XSLT 1.0,此输入:

<div>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</div>

使用此样式表:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="span[@class='url']">
        <xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
    </xsl:template>
</xsl:stylesheet>

输出:

word test.word test2 more words

How can I select the string contents
of the following nodes:

First, I think your question is not clear.

You could select the descendant text nodes as John Kugelman has answer with

//span//text()

I recommend to use the absolute path (not starting with //)

But with this you would need to process the text nodes finding from wich parent span they are childs. So, it would be better to just select the span elements (as example, //span) and then process its string value.

With XPath 2.0 you could use:

string-join(//span, '.')

Result:

word test. word test2 more words

With XSLT 1.0, this input:

<div>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</div>

With this stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="span[@class='url']">
        <xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
    </xsl:template>
</xsl:stylesheet>

Output:

word test.word test2 more words
潦草背影 2024-09-19 17:19:20

沿着 Alejandro 的 XSLT 1.0“但任何其他想法也会有所帮助”的答案...

XML:

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <span class="url">
        word
        <b class=" ">test</b>
    </span>
    <span class="url">
        word
        <b class=" ">test2</b>
        more words
    </span>
</div>

XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="span">
        <xsl:value-of select="normalize-space(data(.))"/>
    </xsl:template>
</xsl:stylesheet>

输出:

word test
word test2 more words

Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...

XML:

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <span class="url">
        word
        <b class=" ">test</b>
    </span>
    <span class="url">
        word
        <b class=" ">test2</b>
        more words
    </span>
</div>

XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="span">
        <xsl:value-of select="normalize-space(data(.))"/>
    </xsl:template>
</xsl:stylesheet>

OUTPUT:

word test
word test2 more words
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文