PHP 抓取库 - phpQuery?

发布于 2024-08-09 19:49:27 字数 850 浏览 5 评论 0原文

我正在寻找一个 PHP 库,它允许我废弃网页并处理所有 cookie 并使用默认值预填充表单,这是最让我烦恼的。

我厌倦了必须将每个输入元素与 xpath 相匹配,如果存在更好的东西,我会很高兴。我遇到过 phpQuery 但手册不太清楚,我不知道如何发出 POST 请求。

有人可以帮助我吗?谢谢。

@Jonathan Fingland:

在 browserGet() 手册提供的示例中,我们有:

require_once('phpQuery/phpQuery.php');

phpQuery::browserGet('http://google.com/', 'success1');

function success1($browser)
{
    $browser->WebBrowser('success2')
    ->find('input[name=q]')->val('search phrase')
    ->parents('form')
    ->submit();
}

function success2($browser)
{
    echo $browser;
}

我想所有其他字段都被废弃并在 GET 请求中发回,我想对 phpQuery 执行相同的操作::browserPost() 方法,但我不知道该怎么做。我试图抓取的表单有一个输入令牌,如果 phpQuery 能够足够智能地抓取令牌并让我更改其他字段(在本例中为用户名和密码),通过 POST 提交所有内容,我会很高兴。

PS:请放心,这不会将用于发送垃圾邮件。

I'm looking for a PHP library that allows me to scrap webpages and takes care about all the cookies and prefilling the forms with the default values, that's what annoys me the most.

I'm tired of having to match every single input element with xpath and I would love if something better existed. I've come across phpQuery but the manual isn't much clear and I can't find out how to make POST requests.

Can someone help me? Thanks.

@Jonathan Fingland:

In the example provided by the manual for browserGet() we have:

require_once('phpQuery/phpQuery.php');

phpQuery::browserGet('http://google.com/', 'success1');

function success1($browser)
{
    $browser->WebBrowser('success2')
    ->find('input[name=q]')->val('search phrase')
    ->parents('form')
    ->submit();
}

function success2($browser)
{
    echo $browser;
}

I suppose all the other fields are scrapped and send back in the GET request, I want to do the same with the phpQuery::browserPost() method but I don't know how to do it. The form I'm trying to scrape has a input token and I would love if phpQuery could be smart enough to scrape the token and just let me change the other fields (in this case username and password), submiting via POST everything.

PS: Rest assured, this is not going to be used for spamming.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

心欲静而疯不止 2024-08-16 19:49:27

请参阅 http://code.google.com/p/phpquery/wiki/Ajax 特别是:

phpQuery::post($url, $data, $callback, $type)

# data Object, String 将数据参数定义为对象或字符串。 POST 请求应该可以使用查询字符串格式,例如:

$data = "username=Jon&password=123456";
$url = "http://www.mysite.com/login.php";
phpQuery::post($url, $data, $callback, $type)

由于 phpQuery 是 jQuery 端口,因此方法签名是相同的(文档直接链接到 jquery 站点 - http://docs.jquery.com/Ajax/jQuery.post)

编辑

两件事:

还有一个 < code>phpQuery::browserPost 函数可能会更好地满足您的需求。

但是,还要注意 success2 回调仅在 submit()click() 方法,以便您可以在此之前填写所有表单字段。

例如

require_once('phpQuery/phpQuery.php');
phpQuery::browserGet('http://www.mysite.com/login.php', 'success1');
function success1($browser) {
  $handle = $browser
    ->WebBrowser('success2');
  $handle 
    ->find('input[name=username]')
      ->val('Jon');
  $handle 
    ->find('input[name=password]')
      ->val('123456');
      ->parents('form')
        ->submit();
}
function success2($browser) {
  print $browser;
}

(请注意,这尚未经过测试,但应该可以工作)

See http://code.google.com/p/phpquery/wiki/Ajax and in particular:

phpQuery::post($url, $data, $callback, $type)

and

# data Object, String which defines the data parameter as being either an Object or a String. POST requests should be possible using query string format, e.g.:

$data = "username=Jon&password=123456";
$url = "http://www.mysite.com/login.php";
phpQuery::post($url, $data, $callback, $type)

as phpQuery is a jQuery port the method signature is the same (the docs link directly to the jquery site -- http://docs.jquery.com/Ajax/jQuery.post)

Edit

Two things:

There is also a phpQuery::browserPost function which might meet your needs better.

However, also note that the success2 callback is only called on the submit() or click() methods so you can fill in all of the form fields prior to that.

e.g.

require_once('phpQuery/phpQuery.php');
phpQuery::browserGet('http://www.mysite.com/login.php', 'success1');
function success1($browser) {
  $handle = $browser
    ->WebBrowser('success2');
  $handle 
    ->find('input[name=username]')
      ->val('Jon');
  $handle 
    ->find('input[name=password]')
      ->val('123456');
      ->parents('form')
        ->submit();
}
function success2($browser) {
  print $browser;
}

(Note that this has not been tested, but should work)

月朦胧 2024-08-16 19:49:27

我过去曾使用 SimpleTest 的 ScriptableBrowser 来处理此类内容。它是 SimpleTest 测试框架的一部分,但您可以单独使用它。

I've used SimpleTest's ScriptableBrowser for such stuff in the past. It's part of the SimpleTest testing framework, but you can use it stand-alone.

入画浅相思 2024-08-16 19:49:27

我会使用一个专用库来解析 HTML 文件和一个专用库来处理 HTTP 请求。在我看来,对两者使用相同的库似乎是一个坏主意。

要处理 HTTP 请求,请查看例如。 HttpfulUnirest请求狂饮。 Guzzle 如今特别流行,但最终,哪个库最适合您仍然取决于个人喜好。

为了解析 HTML 文件,我会推荐一个我自己编写的库: DOM-Query。它允许您 (1) 加载 HTML 文件,然后 (2) 选择或更改 HTML 的部分,这与您在前端应用程序中使用 jQuery 的方式几乎相同。

I would use a dedicated library for parsing HTML files and a dedicated library for processing HTTP requests. Using the same library for both seems like a bad idea, IMO.

For processing HTTP requests, check out eg. Httpful, Unirest, Requests or Guzzle. Guzzle is especially popular these days, but in the end, whichever library works best for you is still a matter of personal taste.

For parsing HTML files I would recommend a library that I wrote myself : DOM-Query. It allows you to (1) load an HTML file and then (2) select or change parts of your HTML pretty much the same way you'd do it if you'd be using jQuery in a frontend app.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文