使用 XPath 进行 PHP 抓取 - html5 问题？

发布于 2025-01-06 18:29:37 字数 1678 浏览 3 评论 0原文

我正在尝试从 URL 中抓取输入框的值。我的 XPath 实现似乎遇到了问题。

要抓取的页面看起来像这样：

<!DOCTYPE html> 
<html lang="en">
    <head></head>
    <body>
        <div><span>Blah</span></div>
        <div><span>Blah</span> Blah</div>
        <div>
            <form method="POST" action="blah">
                <input name="SomeName" id="SomeId" value="GET ME"/>
                <input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
            </form>
        </div>
    </body>
</html>

我试图像这样解析它：

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));

注意：dump() justwrapps print_r() but添加一些堆栈跟踪信息和格式化。

输出如下：

14:50:08 scraper.php 181: (Scraper->Test)
//input[@id='csrfToken-login']/@value

14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)

我假设这意味着它无法在文档中找到与我的选择器匹配的任何内容？我尝试了多种变体，只是想看看我是否可以得到一些东西回来：

/input/@value
/input
//input
/div

我能够从中得到任何东西的唯一选择器是/ 返回整个文档。

我做错了什么？

编辑：由于有些人无法用旧示例重现问题，因此我将其替换为几乎相同的示例，该示例也演示了该问题，但使用公共 URL（LinkedIn 登录页面）。

有人建议这是不可能的，因为解析器在 html5 上被阻塞 - （就像内部页面一样）有人有这方面的经验吗？

原文

I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.

The page to be scraped looks something like:

<!DOCTYPE html> 
<html lang="en">
    <head></head>
    <body>
        <div><span>Blah</span></div>
        <div><span>Blah</span> Blah</div>
        <div>
            <form method="POST" action="blah">
                <input name="SomeName" id="SomeId" value="GET ME"/>
                <input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
            </form>
        </div>
    </body>
</html>

and I'm attempting to parse it like this:

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));

NB: dump() just wraps print_r() but adds some stack trace info and formatting.

The output is as folllowws:

14:50:08 scraper.php 181: (Scraper->Test)
//input[@id='csrfToken-login']/@value

14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)

Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:

/input/@value
/input
//input
/div

The only selector which I've been able to get anything from is / which returns the entire document.

What am I doing wrong?

EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).

There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

说谎友 2025-01-13 18:29:37

如果您的选择器以单斜杠（/）开头，则表示从根开始的绝对路径。您需要使用双斜杠 (//) 选择所有匹配元素，无论其位置如何。

print_r 对此不起作用。除了实际获得价值之外，您的代码中的一切都很好。
PHP 中的列表类通常有一个名为 length 的属性，请检查该属性。

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;

If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.

print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;

回复收藏 0 原文