是否有与 Perl 的 WWW::Mechanize 等效的 PHP 语言?

发布于 2025-01-16 22:37:43 字数 1041 浏览 1 评论 0 原文

我正在寻找一个功能类似于 Perl 的 的库WWW::Mechanize,但适用于 PHP。基本上,它应该允许我使用简单的语法提交 HTTP GET 和 POST 请求,然后解析结果页面并以简单的格式返回所有表单及其字段,以及页面上的所有链接。

我了解 CURL,但它有点太简单了,而且语法相当难看(大量 curl_foo($curl_handle, ...) 语句

澄清:

我想要一些东西比到目前为止的答案更高级,例如,在 Perl 中,您可以执行以下操作:

# navigate to the main page
$mech->get( 'http://www.somesite.com/' ); 

# follow a link that contains the text 'download this'
$mech->follow_link( text_regex => qr/download this/i );

# submit a POST form, to log into the site
$mech->submit_form(
    with_fields      => {
        username    => 'mungo',
        password    => 'lost-and-alone',
    }
);

# save the results as a file
$mech->save_content('somefile.zip');

要使用 HTTP_Client 或 wget 或 CURL 执行相同的操作将需要大量工作,我必须手动解析页面以进行处理。找到链接、查找表单 URL、提取所有隐藏字段等等 我要求 PHP 解决方案的原因是我没有使用 Perl 的经验,并且我可能可以通过大量工作来构建我需要的内容。但如果我能在 PHP 中完成上述操作,速度会快得多。

I'm looking for a library that has functionality similar to Perl's WWW::Mechanize, but for PHP. Basically, it should allow me to submit HTTP GET and POST requests with a simple syntax, and then parse the resulting page and return in a simple format all forms and their fields, along with all links on the page.

I know about CURL, but it's a little too barebones, and the syntax is pretty ugly (tons of curl_foo($curl_handle, ...) statements

Clarification:

I want something more high-level than the answers so far. For example, in Perl, you could do something like:

# navigate to the main page
$mech->get( 'http://www.somesite.com/' ); 

# follow a link that contains the text 'download this'
$mech->follow_link( text_regex => qr/download this/i );

# submit a POST form, to log into the site
$mech->submit_form(
    with_fields      => {
        username    => 'mungo',
        password    => 'lost-and-alone',
    }
);

# save the results as a file
$mech->save_content('somefile.zip');

To do the same thing using HTTP_Client or wget or CURL would be a lot of work, I'd have to manually parse the pages to find the links, find the form URL, extract all the hidden fields, and so on. The reason I'm asking for a PHP solution is that I have no experience with Perl, and I could probably build what I need with a lot of work, but it would be much quicker if I could do the above in PHP.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

风蛊 2025-01-23 22:37:43

SimpleTest 的 ScriptableBrowser 可以独立于测试框架使用。我已经将它用于许多自动化工作。

SimpleTest's ScriptableBrowser can be used independendly from the testing framework. I've used it for numerous automation-jobs.

揪着可爱 2025-01-23 22:37:43

我觉得有必要回答这个问题,尽管这是一篇旧帖子...我一直在使用 PHP curl ,但它远不如 WWW:Mechanize 这样的好,我正在转向它(我认为我将使用 Ruby 语言实现).. Curl 已经过时了,因为它需要太多的“繁重工作”来自动化任何事情,最简单的可编写脚本的浏览器对我来说看起来很有希望,但在测试它时,它在大多数网络上都不起作用我尝试的形式...老实说,我认为 PHP 在抓取、网络自动化这一类别中缺乏,所以最好看看不同的语言,只是想发布这个,因为我在这个主题上花了无数的时间,也许它会节省其他人一些时间未来。

I feel compelled to answer this, even though its an old post... I've been working with PHP curl a lot and it is not as good anywhere near comparable to something like WWW:Mechanize, which I am switching to (I think I am going to go with the Ruby language implementation).. Curl is outdated as it requires too much "grunt work" to automate anything, the simpletest scriptable browser looked promising to me but in testing it, it won't work on most web forms I try it on... honestly, I think PHP is lacking in this category of scraping, web automation so its best to look at a different language, just wanted to post this since I have spent countless hours on this topic and maybe it will save someone else some time in the future.

遮云壑 2025-01-23 22:37:43

现在已经是 2016 年了,还有 Mink。它甚至支持不同的引擎,从无头纯 PHP“浏览器”(没有 JavaScript)、Selenium(需要 Firefox 或 Chrome 等浏览器)到 NPM 中的无头“browser.js”(它支持 JavaScript)。

It's 2016 now and there's Mink. It even supports different engines from headless pure-PHP "browser" (without JavaScript), over Selenium (which needs a browser like Firefox or Chrome) to a headless "browser.js" in NPM, which DOES support JavaScript.

似狗非友 2025-01-23 22:37:43

尝试在 PEAR 库中查找。如果一切都失败了,请为curl 创建一个对象包装器。

你可以这样简单:

class curl {
    private $resource;

    public function __construct($url) {
        $this->resource = curl_init($url);
    }

    public function __call($function, array $params) {
        array_unshift($params, $this->resource);
        return call_user_func_array("curl_$function", $params);
    }
}

Try looking in the PEAR library. If all else fails, create an object wrapper for curl.

You can so something simple like this:

class curl {
    private $resource;

    public function __construct($url) {
        $this->resource = curl_init($url);
    }

    public function __call($function, array $params) {
        array_unshift($params, $this->resource);
        return call_user_func_array("curl_$function", $params);
    }
}
甜警司 2025-01-23 22:37:43

尝试以下方法之一:

(是的,这是 ZendFramework 代码,但它使用它不会使你的类变慢,因为它只是加载所需的库。)

Try one of the following:

(Yes, it's ZendFramework code, but it doesn't make your class slower using it since it just loads the required libs.)

过期以后 2025-01-23 22:37:43

Curl 是处理简单请求的方法。它跨平台运行,具有 PHP 扩展,并被广泛采用和测试。

我创建了一个很好的类,只需调用 CurlHandler::Get($url, $data) || 即可获取数据数组(包括文件!)并将其发布到 url。 CurlHandler::Post($url, $data)。还有一个可选的 HTTP 用户身份验证选项:)

/**
 * CURLHandler handles simple HTTP GETs and POSTs via Curl 
 * 
 * @package Pork
 * @author SchizoDuckie
 * @copyright SchizoDuckie 2008
 * @version 1.0
 * @access public
 */
class CURLHandler
{

    /**
     * CURLHandler::Get()
     * 
     * Executes a standard GET request via Curl.
     * Static function, so that you can use: CurlHandler::Get('http://www.google.com');
     * 
     * @param string $url url to get
     * @return string HTML output
     */
    public static function Get($url)
    {
       return self::doRequest('GET', $url);
    }

    /**
     * CURLHandler::Post()
     * 
     * Executes a standard POST request via Curl.
     * Static function, so you can use CurlHandler::Post('http://www.google.com', array('q'=>'StackOverFlow'));
     * If you want to send a File via post (to e.g. PHP's $_FILES), prefix the value of an item with an @ ! 
     * @param string $url url to post data to
     * @param Array $vars Array with key=>value pairs to post.
     * @return string HTML output
     */
    public static function Post($url, $vars, $auth = false) 
    {
       return self::doRequest('POST', $url, $vars, $auth);
    }

    /**
     * CURLHandler::doRequest()
     * This is what actually does the request
     * <pre>
     * - Create Curl handle with curl_init
     * - Set options like CURLOPT_URL, CURLOPT_RETURNTRANSFER and CURLOPT_HEADER
     * - Set eventual optional options (like CURLOPT_POST and CURLOPT_POSTFIELDS)
     * - Call curl_exec on the interface
     * - Close the connection
     * - Return the result or throw an exception.
     * </pre>
     * @param mixed $method Request Method (Get/ Post)
     * @param mixed $url URI to get or post to
     * @param mixed $vars Array of variables (only mandatory in POST requests)
     * @return string HTML output
     */
    public static function doRequest($method, $url, $vars=array(), $auth = false)
    {
        $curlInterface = curl_init();

        curl_setopt_array ($curlInterface, array( 
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_FOLLOWLOCATION =>1,
            CURLOPT_HEADER => 0));
        if (strtoupper($method) == 'POST')
        {
            curl_setopt_array($curlInterface, array(
                CURLOPT_POST => 1,
                CURLOPT_POSTFIELDS => http_build_query($vars))
            );  
        }
        if($auth !== false)
        {
              curl_setopt($curlInterface, CURLOPT_USERPWD, $auth['username'] . ":" . $auth['password']);
        }
        $result = curl_exec ($curlInterface);
        curl_close ($curlInterface);

        if($result === NULL)
        {
            throw new Exception('Curl Request Error: '.curl_errno($curlInterface) . " - " . curl_error($curlInterface));
        }
        else
        {
            return($result);
        }
    }

}

?>

[编辑] 现在才阅读说明...您可能想要使用上面提到的自动化工具之一。您还可以决定使用客户端 Firefox 扩展,例如 ChickenFoot 了解更多信息灵活性。我将把上面的示例类留在这里以供将来搜索。

Curl is the way to go for simple requests. It runs cross platform, has a PHP extension and is widely adopted and tested.

I created a nice class that can GET and POST an array of data (INCLUDING FILES!) to a url by just calling CurlHandler::Get($url, $data) || CurlHandler::Post($url, $data). There's an optional HTTP User authentication option too :)

/**
 * CURLHandler handles simple HTTP GETs and POSTs via Curl 
 * 
 * @package Pork
 * @author SchizoDuckie
 * @copyright SchizoDuckie 2008
 * @version 1.0
 * @access public
 */
class CURLHandler
{

    /**
     * CURLHandler::Get()
     * 
     * Executes a standard GET request via Curl.
     * Static function, so that you can use: CurlHandler::Get('http://www.google.com');
     * 
     * @param string $url url to get
     * @return string HTML output
     */
    public static function Get($url)
    {
       return self::doRequest('GET', $url);
    }

    /**
     * CURLHandler::Post()
     * 
     * Executes a standard POST request via Curl.
     * Static function, so you can use CurlHandler::Post('http://www.google.com', array('q'=>'StackOverFlow'));
     * If you want to send a File via post (to e.g. PHP's $_FILES), prefix the value of an item with an @ ! 
     * @param string $url url to post data to
     * @param Array $vars Array with key=>value pairs to post.
     * @return string HTML output
     */
    public static function Post($url, $vars, $auth = false) 
    {
       return self::doRequest('POST', $url, $vars, $auth);
    }

    /**
     * CURLHandler::doRequest()
     * This is what actually does the request
     * <pre>
     * - Create Curl handle with curl_init
     * - Set options like CURLOPT_URL, CURLOPT_RETURNTRANSFER and CURLOPT_HEADER
     * - Set eventual optional options (like CURLOPT_POST and CURLOPT_POSTFIELDS)
     * - Call curl_exec on the interface
     * - Close the connection
     * - Return the result or throw an exception.
     * </pre>
     * @param mixed $method Request Method (Get/ Post)
     * @param mixed $url URI to get or post to
     * @param mixed $vars Array of variables (only mandatory in POST requests)
     * @return string HTML output
     */
    public static function doRequest($method, $url, $vars=array(), $auth = false)
    {
        $curlInterface = curl_init();

        curl_setopt_array ($curlInterface, array( 
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_FOLLOWLOCATION =>1,
            CURLOPT_HEADER => 0));
        if (strtoupper($method) == 'POST')
        {
            curl_setopt_array($curlInterface, array(
                CURLOPT_POST => 1,
                CURLOPT_POSTFIELDS => http_build_query($vars))
            );  
        }
        if($auth !== false)
        {
              curl_setopt($curlInterface, CURLOPT_USERPWD, $auth['username'] . ":" . $auth['password']);
        }
        $result = curl_exec ($curlInterface);
        curl_close ($curlInterface);

        if($result === NULL)
        {
            throw new Exception('Curl Request Error: '.curl_errno($curlInterface) . " - " . curl_error($curlInterface));
        }
        else
        {
            return($result);
        }
    }

}

?>

[edit] Read the clarification only now... You probably want to go with one of the tools mentioned above that automates stuff. You could also decide to use a clientside firefox extension like ChickenFoot for more flexibility. I'll leave the example class above here for future searches.

番薯 2025-01-23 22:37:43

如果您在项目中使用 CakePHP,或者您倾向于提取相关库,则可以使用他们的curl 包装器 HttpSocket。它具有您描述的简单页面获取语法,例如

# This is the sugar for importing the library within CakePHP       
App::import('Core', 'HttpSocket');
$HttpSocket = new HttpSocket();

$result = $HttpSocket->post($login_url,
array(
  "username" => "username",
  "password" => "password"
)
);

......尽管它没有办法解析响应页面。为此,我将使用 simplehtmldom: http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/ 将自己描述为具有类似 jQuery 的语法。

我倾向于同意,底线是 PHP 没有 Perl/Ruby 所拥有的出色的抓取/自动化库。

If you're using CakePHP in your project, or if you're inclined to extract the relevant library you can use their curl wrapper HttpSocket. It has the simple page-fetching syntax you describe, e.g.,

# This is the sugar for importing the library within CakePHP       
App::import('Core', 'HttpSocket');
$HttpSocket = new HttpSocket();

$result = $HttpSocket->post($login_url,
array(
  "username" => "username",
  "password" => "password"
)
);

...although it doesn't have a way to parse the response page. For that I'm going to use simplehtmldom: http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/ which describes itself as having a jQuery-like syntax.

I tend to agree that the bottom line is that PHP doesn't have the awesome scraping/automation libraries that Perl/Ruby have.

小…红帽 2025-01-23 22:37:43

如果您使用的是 *nix 系统,您可以将 shell_exec() 与 wget 一起使用,它有很多不错的选项。

If you're on a *nix system you could use shell_exec() with wget, which has a lot of nice options.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文