使用 XPath 进行 PHP 抓取 - html5 问题?

发布于 2025-01-06 18:29:37 字数 1678 浏览 3 评论 0原文

我正在尝试从 URL 中抓取输入框的值。我的 XPath 实现似乎遇到了问题。

要抓取的页面看起来像这样:

<!DOCTYPE html> 
<html lang="en">
    <head></head>
    <body>
        <div><span>Blah</span></div>
        <div><span>Blah</span> Blah</div>
        <div>
            <form method="POST" action="blah">
                <input name="SomeName" id="SomeId" value="GET ME"/>
                <input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
            </form>
        </div>
    </body>
</html>

我试图像这样解析它:

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));

注意:dump() justwrapps print_r() but添加一些堆栈跟踪信息和格式化。

输出如下:

14:50:08 scraper.php 181: (Scraper->Test)
//input[@id='csrfToken-login']/@value

14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)

我假设这意味着它无法在文档中找到与我的选择器匹配的任何内容?我尝试了多种变体,只是想看看我是否可以得到一些东西回来:

/input/@value
/input
//input
/div

我能够从中得到任何东西的唯一选择器是/ 返回整个文档。

我做错了什么?

编辑:由于有些人无法用旧示例重现问题,因此我将其替换为几乎相同的示例,该示例也演示了该问题,但使用公共 URL(LinkedIn 登录页面)。

有人建议这是不可能的,因为解析器在 html5 上被阻塞 - (就像内部页面一样)有人有这方面的经验吗?

I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.

The page to be scraped looks something like:

<!DOCTYPE html> 
<html lang="en">
    <head></head>
    <body>
        <div><span>Blah</span></div>
        <div><span>Blah</span> Blah</div>
        <div>
            <form method="POST" action="blah">
                <input name="SomeName" id="SomeId" value="GET ME"/>
                <input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
            </form>
        </div>
    </body>
</html>

and I'm attempting to parse it like this:

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));

NB: dump() just wraps print_r() but adds some stack trace info and formatting.

The output is as folllowws:

14:50:08 scraper.php 181: (Scraper->Test)
//input[@id='csrfToken-login']/@value

14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)

Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:

/input/@value
/input
//input
/div

The only selector which I've been able to get anything from is / which returns the entire document.

What am I doing wrong?

EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).

There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

说谎友 2025-01-13 18:29:37

如果您的选择器以单斜杠(/)开头,则表示从根开始的绝对路径。您需要使用双斜杠 (//) 选择所有匹配元素,无论其位置如何。

print_r 对此不起作用。除了实际获得价值之外,您的代码中的一切都很好。
PHP 中的列表类通常有一个名为 length 的属性,请检查该属性。

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;

If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.

print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;
迷你仙 2025-01-13 18:29:37

DOMXPath 对我来说看起来不错。

对于 xpath 使用后代或自身快捷方式 // 来获取输入标签

//input[@id='SomeId']/@value

DOMXPath looks fine to me.

As for the xpath use descendant-or-self shortcut // to get to the input tag

//input[@id='SomeId']/@value
妳是的陽光 2025-01-13 18:29:37

我访问过您指定的 LinkedIn 登录页面,但该页面格式错误;即使您的精简示例也有一个未封闭的输入节点。我对 PHP 的 XPath 实现一无所知,但我猜测没有直接的 XPath API 能够处理格式错误的文档。

顺便说一句,您的 XPath 是正确的。

在开始查询之前,您可能需要使用 TagSoup 的中间步骤来“良好地形成”源它,或谷歌“tag soup php”以获取任何特定于 PHP 的解决方案/实现。

我希望这有帮助,
扎卡里

I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.

Your XPath is correct, by the way.

You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.

I hope this helps,
Zachary

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文