使用 XPath 提取节点值

发布于 2024-12-29 07:58:50 字数 2680 浏览 0 评论 0原文

我想从 amazon.com 的一个部分中提取每个项目的数据(仅节点值,而不是链接)。

我正在寻找的值在里面

<ul data-typeid="n" id="ref_1000">
    <li style="margin-left: -18px">
        <a href="/s/ref=sr_ex_n_0?rh=i%3Aaps%2Ck%3Ahow+to+grow+tomatoes&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358">
            <span class="expand">Any Department</span>
        </a>
    </li>
    <li style="margin-left: 8px">
        <strong>Books</strong>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_0?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A48&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Crafts, Hobbies & Home</span><span class="narrowValue">(19)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
       <a href="/s/ref=sr_nr_n_1?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A10&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Health, Fitness & Dieting</span><span class="narrowValue">(3)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_2?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A6&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Cookbooks, Food & Wine</span><span class="narrowValue">(2)</span>
        </a>
    </li>
</ul>

我如何使用 XPath 来做到这一点?

代码来自链接亚马逊 kindle 搜索

目前我正在尝试

$rank=array();

$words = $xpath->query('//ul[@id="ref_1000"]/li/a/span[@class="refinementLink"]');
foreach ($words as $word) {

        $rank[]=(trim($word->nodeValue));


 }
 var_dump($rank);

There is a section of amazon.com from which I want to extract the data (node value only, not the link) for each item.

The value I'm looking for is inside and <span class="narrowValue">

<ul data-typeid="n" id="ref_1000">
    <li style="margin-left: -18px">
        <a href="/s/ref=sr_ex_n_0?rh=i%3Aaps%2Ck%3Ahow+to+grow+tomatoes&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358">
            <span class="expand">Any Department</span>
        </a>
    </li>
    <li style="margin-left: 8px">
        <strong>Books</strong>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_0?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A48&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
            <span class="refinementLink">Crafts, Hobbies & Home</span><span class="narrowValue">(19)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
       <a href="/s/ref=sr_nr_n_1?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A10&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
            <span class="refinementLink">Health, Fitness & Dieting</span><span class="narrowValue">(3)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_2?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A6&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
            <span class="refinementLink">Cookbooks, Food & Wine</span><span class="narrowValue">(2)</span>
        </a>
    </li>
</ul>

How could I do this with XPath?

the code is from the link amazon kindle search

currently i am trying

$rank=array();

$words = $xpath->query('//ul[@id="ref_1000"]/li/a/span[@class="refinementLink"]');
foreach ($words as $word) {

        $rank[]=(trim($word->nodeValue));


 }
 var_dump($rank);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

盛夏已如深秋| 2025-01-05 07:58:50

以下表达式应该有效:

//*[@id='ref_1000']/li/a/span[@class='narrowValue']

为了获得更好的性能,您可以提供到该表达式开头的直接路径,但提供的路径更灵活(假设您可能需要它跨多个页面工作)。

另请记住,您的 HTML 解析器可能会生成与 Firebug(我在其中进行测试)生成的结果树不同的结果树。这是一个更灵活的解决方案:

//*[@id='ref_1000']//span[@class='narrowValue']

灵活性伴随着潜在的性能(和准确性)成本,但它通常是处理标签汤时的唯一选择。

The following expression should work:

//*[@id='ref_1000']/li/a/span[@class='narrowValue']

For better performance you could provide a direct path to the start of this expression, but the one provided is more flexible (given that you probably need this to work across multiple pages).

Keep in mind, also, that your HTML parser might generate a different result tree than the one produced by Firebug (where I tested). Here's an even more flexible solution:

//*[@id='ref_1000']//span[@class='narrowValue']

Flexibility comes with potential performance (and accuracy) costs, but it's often the only choice when dealing with tag soup.

傲世九天 2025-01-05 07:58:50

如果您需要获取类别名称:

// Suppress invalid markup warnings
libxml_use_internal_errors(true);

// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html - string fetched by CURL 
$xml = simplexml_import_dom($doc);

// Find a category nodes
$categories = $xml->xpath("//span[@class='refinementLink']");

编辑。使用 DOM 文档

$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);

// Select the parent node
$categories = $xpath->query("//span[@class='refinementLink']/..");

foreach ($categories as $category) {
    echo '<pre>';
    echo $category->childNodes->item(1)->firstChild->nodeValue; 
    echo $category->childNodes->item(2)->firstChild->nodeValue;
    echo '</pre>';
    // Crafts, Hobbies & Home (19)
}

If you need to grap the categories names:

// Suppress invalid markup warnings
libxml_use_internal_errors(true);

// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html - string fetched by CURL 
$xml = simplexml_import_dom($doc);

// Find a category nodes
$categories = $xml->xpath("//span[@class='refinementLink']");

EDIT. Using DOMDocument

$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);

// Select the parent node
$categories = $xpath->query("//span[@class='refinementLink']/..");

foreach ($categories as $category) {
    echo '<pre>';
    echo $category->childNodes->item(1)->firstChild->nodeValue; 
    echo $category->childNodes->item(2)->firstChild->nodeValue;
    echo '</pre>';
    // Crafts, Hobbies & Home (19)
}
为人所爱 2025-01-05 07:58:50

我强烈建议您查看 phpQuery 库。它本质上是 PHP 的 jQuery 选择器引擎,因此要获取您想要的文本,您可以执行以下操作:

foreach (pq('span.refinementLink') as $p) {
  print $p->text() . "\n";
}

应该输出类似以下内容:

Crafts, Hobbies & Home
Health, Fitness & Dieting
Cookbooks, Food & Wine

这是迄今为止我所知道的 PHP 中最简单的屏幕抓取、DOM 解析。

I'd highly recommend you checkout the phpQuery library. It's essentially the jQuery selectors engine for PHP, so to get at the text you're wanting you could do something like:

foreach (pq('span.refinementLink') as $p) {
  print $p->text() . "\n";
}

That should output something like:

Crafts, Hobbies & Home
Health, Fitness & Dieting
Cookbooks, Food & Wine

It's by far the easiest screen scraping, DOM parsing thing I know of for PHP.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文