使用 xpath 过滤提取的数据

发布于 2025-01-04 06:59:54 字数 2773 浏览 0 评论 0原文

我使用以下代码作为 dom 文档的输入

<li id="SalesRank">
<b>Amazon Best Sellers Rank:</b>
#20,267 Paid in Kindle Store (
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_dp_ts_kstore_1/190-9295683-0277616">See Top 100 Paid in Kindle Store</a>
) 
<ul class="zg_hrsr">
<li class="zg_hrsr_item">
<span class="zg_hrsr_rank">#15</span>
<span class="zg_hrsr_ladder">
in 
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_zg_hrsr_kstore_1_1">Kindle Store</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/154606011">Kindle eBooks</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/157325011">Nonfiction</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/292975011">Lifestyle & Home</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156699011">Home & Garden</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156828011">Gardening & Horticulture</a>
 > 
<b>
<a href="http://rads.stackoverflow.com/amzn/click/156847011">Greenhouses</a>
</b>
</span>
</li>
<li class="zg_hrsr_item">
<span class="zg_hrsr_rank">#26</span>
<span class="zg_hrsr_ladder">
in 
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_zg_hrsr_kstore_2_1">Kindle Store</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/154606011">Kindle eBooks</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/157325011">Nonfiction</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/292975011">Lifestyle & Home</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156699011">Home & Garden</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156828011">Gardening & Horticulture</a>
 > 
<b>
<a href="http://rads.stackoverflow.com/amzn/click/156849011">House Plants</a>
</b>
</span>
</li>
</ul></li>

我使用以下 xpath 查询来提取包含 textContent 的数据..

$xpath_cat->query('//li[@id="SalesRank"]');

您可以检查输出,它包括所有 li 标签中包含的数据与 id=salrsrank... 而我只想获得 Kindle 商店中支付的 #20,267..

所以所需的输出是

#20,267 在 Kindle 商店中付款

我如何修改我的 xpath 以获得所需的输出?

更新代码

我尝试了下面提供的解决方案并使用了 xpath

$xpath_cat->query('//li[@id="SalesRank"]/text()');

但现在,输出是

( [0] => [1] => #20,267 在 Kindle 商店中付费 ( [2] => )

我该如何解决这个问题?

i am using the following code as input to dom document

<li id="SalesRank">
<b>Amazon Best Sellers Rank:</b>
#20,267 Paid in Kindle Store (
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_dp_ts_kstore_1/190-9295683-0277616">See Top 100 Paid in Kindle Store</a>
) 
<ul class="zg_hrsr">
<li class="zg_hrsr_item">
<span class="zg_hrsr_rank">#15</span>
<span class="zg_hrsr_ladder">
in 
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_zg_hrsr_kstore_1_1">Kindle Store</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/154606011">Kindle eBooks</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/157325011">Nonfiction</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/292975011">Lifestyle & Home</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156699011">Home & Garden</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156828011">Gardening & Horticulture</a>
 > 
<b>
<a href="http://rads.stackoverflow.com/amzn/click/156847011">Greenhouses</a>
</b>
</span>
</li>
<li class="zg_hrsr_item">
<span class="zg_hrsr_rank">#26</span>
<span class="zg_hrsr_ladder">
in 
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_zg_hrsr_kstore_2_1">Kindle Store</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/154606011">Kindle eBooks</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/157325011">Nonfiction</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/292975011">Lifestyle & Home</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156699011">Home & Garden</a>
 > 
<a href="http://rads.stackoverflow.com/amzn/click/156828011">Gardening & Horticulture</a>
 > 
<b>
<a href="http://rads.stackoverflow.com/amzn/click/156849011">House Plants</a>
</b>
</span>
</li>
</ul></li>

i am using the following xpath query to extract data with textContent..

$xpath_cat->query('//li[@id="SalesRank"]');

you can check the output, it includes the data which is including in all the li tags with id=salrsrank... while i want to get only the #20,267 paid in kindle store..

so the output required is

#20,267 Paid in Kindle Store

how can i modify my xpath to get the required output?

Update in code

i tried the solution provided below and used the xpath

$xpath_cat->query('//li[@id="SalesRank"]/text()');

but now, the output is

( [0] => [1] => #20,267 Paid in Kindle Store ( [2] => )

how can i fix this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

挽清梦 2025-01-11 06:59:54

//li[@id='SalesRank']/text() 适合您吗?

更新 1

如果您想要的文本始终位于该位置,则将

substring-before(normalize-space(//li[@id='SalesRank']/text()[2]), ' (')

返回

#20,267 Paid in Kindle Store

这使用 normailize-space 去除无关的空格,以及 substring-before< /code> 选择第一次出现“ (”之前的所有文本。

如果您可以在自己的节点中获取目标文本,这个问题会容易得多,例如:

<b>Amazon Best Sellers Rank:</b>
<span>#20,267 Paid in Kindle Store</span> (
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_dp_ts_kstore_1/190-9295683-0277616">See Top 100 Paid in Kindle Store</a>
) 

has对渲染没有影响,并允许您专门选择您想要的文本,

如果第二种解决方案在所有情况下都不起作用,并且您现在无法获得自己的目标文本,则您将不得不依赖于一些后处理。主机语言(我认为是 PHP)

希望这有帮助,

Does //li[@id='SalesRank']/text() work for you?

Update 1

If the text you want will always be in that location, then

substring-before(normalize-space(//li[@id='SalesRank']/text()[2]), ' (')

will return

#20,267 Paid in Kindle Store

This uses normailize-space to strip out extraneous whitepspace, and substring-before to select all text before the first occurence of " (".

This problem will be much easier if you can get the target text in its own node, like:

<b>Amazon Best Sellers Rank:</b>
<span>#20,267 Paid in Kindle Store</span> (
<a href="http://www.amazon.com/gp/bestsellers/digital-text/ref=pd_dp_ts_kstore_1/190-9295683-0277616">See Top 100 Paid in Kindle Store</a>
) 

<span/> has no effect on rendering and allows you to specifically select the text you want.

If either the second solution doesn't work in all cases, and you cannot get the target text in its own now, you will have to rely on some post-processing in the host language (PHP I presume).

Hope this helps,

相权↑美人 2025-01-11 06:59:54

假设 $elementDOMElement 包含

  • ...
  • 如果您执行以下操作:

    foreach( $element->childNodes as $node){
        echo get_class( $node) . "\n";
        // Print content too for debug:
        // echo $node->nodeValue . "\n";
    }
    

    您应该得到如下结果:

    DOMText // \n
    DOMElement // <b>Amazon Best Sellers Rank:</b>
    DOMText // #20,267 Paid in Kindle Store (\n
    DOMElement // <a ...
    ...
    

    所以 $element->childNodes-> ;物品( 2)->nodeValue 应该包含你的字符串(做你的作业,检查每个迭代,检查每个元素的文档)。

    您可以简单地获取字符串,直到第一个 (:

    $text = $element->childNodes->item( 2)->nodeValue;
    $pos = strpos( $text, '(') - 2; // Add handling for no occurance
    return substr( $text, 0, $pos);
    

    或者您可以迭代所有子节点并动态检查它

    foreach( $element->childNodes as $node){
        // Example, rather use regexp with preg_match
        if( (get_class( $node) == 'DOMText')
            && (strncmp( $node->nodeValue, "\n#", 2) == 0)){
            // Tadaaa
            break;
        }
    }
    

    或者如果您想要解决方案会杀死几只小猫:

    preg_match( '~(#([\d,]+) ([^<>(]+))~', $element->nodeValue, $match);
    

    Let's assume that $element is DOMElement containing <li id="SalesRank">...

    If you do something like this:

    foreach( $element->childNodes as $node){
        echo get_class( $node) . "\n";
        // Print content too for debug:
        // echo $node->nodeValue . "\n";
    }
    

    You should get results like:

    DOMText // \n
    DOMElement // <b>Amazon Best Sellers Rank:</b>
    DOMText // #20,267 Paid in Kindle Store (\n
    DOMElement // <a ...
    ...
    

    So $element->childNodes->item( 2)->nodeValue should contain your string (do your homework, check each iteration, check documentation for each element).

    Than you can simply get string until first (:

    $text = $element->childNodes->item( 2)->nodeValue;
    $pos = strpos( $text, '(') - 2; // Add handling for no occurance
    return substr( $text, 0, $pos);
    

    Or you could iterate trough all child nodes and check it on the fly

    foreach( $element->childNodes as $node){
        // Example, rather use regexp with preg_match
        if( (get_class( $node) == 'DOMText')
            && (strncmp( $node->nodeValue, "\n#", 2) == 0)){
            // Tadaaa
            break;
        }
    }
    

    Or if you want solution that will kill few kittens:

    preg_match( '~(#([\d,]+) ([^<>(]+))~', $element->nodeValue, $match);
    
    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文