DOM 和XPath 抓取 - 当只需要一个时，返回两个 li 元素文本

发布于 2024-11-03 07:48:05 字数 1164 浏览 2 评论 0原文

到目前为止，这是我的代码。

$dom_currys = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom_currys->loadHTMLFile('http://www.currys.co.uk/gbuk/apple-new-ipod-touch-8gb-4th-generation-07677427-pdt.html');
libxml_clear_errors();

$xpath_currys = new DOMXpath($dom_currys);
$nodes_currys = $xpath_currys->query(
    '/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li/span'
);
$currys_stock_data = $nodes_currys->item(0)->nodeValue; // "Available for home delivery"

echo $currys_stock_data;

当回显时，它返回

Â Available for home delivery
Â Available to reserve & collect

我只需要“可用于送货上门”部分。每个元素都位于单独的“li”元素中，但是它仍然会返回两个元素，如果我想要第二个元素，则 XPath

/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li[2]/span

我怀疑它与选择正确的项目有关，但不确定它是否正确。

我还要求通过 IF 语句检查结果。到目前为止我所得到的：

if (strpos($currys_stock_data, 'Available for home') !== false) {
    $currys_stockyesno = "Yes";
} else {
    $currys_stockyesno = "No";
}
echo $currys_stockyesno;

我认为最好检查它是否包含“可送货上门”而不是直接匹配，因为该网站有时会说它可在 2 天内送货上门，沿着这些思路。只要字符串包含该字符串，那么它就会返回 true/yes。但它说不...

原文

Here is my code so far.

$dom_currys = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom_currys->loadHTMLFile('http://www.currys.co.uk/gbuk/apple-new-ipod-touch-8gb-4th-generation-07677427-pdt.html');
libxml_clear_errors();

$xpath_currys = new DOMXpath($dom_currys);
$nodes_currys = $xpath_currys->query(
    '/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li/span'
);
$currys_stock_data = $nodes_currys->item(0)->nodeValue; // "Available for home delivery"

echo $currys_stock_data;

When echoed, it comes back with

Â Available for home delivery
Â Available to reserve & collect

I only require the "Available for home delivery" part. Each is in a separate "li" element, however still it brings back both, the XPath if I wanted the second one would be

/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li[2]/span

I suspect its to do with selecting the correct item, but not sure if its right or not.

Also I require that the result be checked by an IF statement. What I have so far:

if (strpos($currys_stock_data, 'Available for home') !== false) {
    $currys_stockyesno = "Yes";
} else {
    $currys_stockyesno = "No";
}
echo $currys_stockyesno;

I thought it would be best to check if it contained "Available for home delivery" rather than a straight match, because the website can sometimes say its available for home delivery in 2 days, along those lines. So long as the string contained that string, then it would return as true/yes. But it's saying no...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

葬﹪忆之殇 2024-11-10 07:48:05

我查看了您正在废弃的网站，发现 li 实际上包含文本。 span 有一个图标类。由于复选标记图标发生了变化，我们也需要检查这一点。但是，您似乎并不真正需要该文本，您需要检查该商品是否允许送货上门。

$xpath = "//li[contains(., 'Available for home delivery')]/span[class='icon icon-check']";

然后，只需检查长度：

if( $nodes_currys->length === 1 ) // true if available for home.

我还应该注意，此方法不适用于他们的搜索/浏览页面，因为他们在那里使用图像......非常令人困惑，为什么我讨厌抓取：P

I looked a the site you are scrapping and found that the li is actually what contains the text. The span has a class on it for the icon. Since the check mark icon changes, we need to check for this too. However, it doesn't seem like you actually need the text, you need to check if the item allows home delivery.

$xpath = "//li[contains(., 'Available for home delivery')]/span[class='icon icon-check']";

Then, just check the length:

if( $nodes_currys->length === 1 ) // true if available for home.

I should also note that this method will not work on their search/browse pages since they use images there.....very confusing and why I hate scraping :P

回复收藏 0 原文

梦回梦里 2024-11-10 07:48:05

我已经检查了 HTML 源代码并确认

/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li

选择了两个元素。

如果您只想选择两个文本节点中的第一个，请使用：

/html/body
      /div/div/div[2]
                /div/div/div[2]
                          /div/ul[2]/li[1]
                                      /span/following-sibling::text()

I have examined the HTML source and confirmed that

/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li

selects two elements.

If you want to select only the first of the two text nodes, use:

/html/body
      /div/div/div[2]
                /div/div/div[2]
                          /div/ul[2]/li[1]
                                      /span/following-sibling::text()

回复收藏 0 原文

~没有更多了~