当前位置：文江博客话题详情

如何抓取网站数据？

发布于 2024-08-06 12:17:55 字数 105 浏览 4 评论 0原文

因此，我经常检查我的帐户是否有不同的号码。例如，我的联属账户：我检查现金增加情况。

我想编写一个脚本，它可以登录所有这些网站，然后为我获取金钱价值并将其显示在一页上。我该如何编程？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

叹沉浮 2024-08-13 12:17:55

你应该看看 curl。
您应该能够生成一个可以轻松检索某些网页的脚本。

另请查看 simplexml 和 dom，它将帮助您从（X）HTML文件中提取信息。

另外 Zend_Http 可能是curl的一个很好的替代品。

干杯

回复收藏 0 原文

三岁铭 2024-08-13 12:17:55

好吧，这是一个模糊的问题...我建议执行以下步骤：

通过 POST 发送登录凭据
抓取并解析响应
对所有相关帐户执行此操作/您想要检查的网站

如果您遇到特定问题，请随时对此答案发表评论

编辑：我同意 RageZ 的技术方法。卷曲对我来说也是“首选武器”...^^

hth

回复收藏 0 原文

狼性发作 2024-08-13 12:17:55

首先检查您要登录的服务是否有API。
它要容易得多，因为这是一种专门为获取数据并在其他应用程序中利用它们而设计的格式。

如果有 API，您可以查看它的文档以了解如何检索和使用数据。

如果没有，您需要废弃 HTML 页面。
您可以首先查看 Curl ： http://php.net/curl
这个想法是通过发送登录帖子请求并获取给定的数据来模拟您自己对网站的访问。

检索页面数据后，您可以使用 dom 等工具解析它们。
http://php.net/dom

回复收藏 0 原文

烟燃烟灭 2024-08-13 12:17:55

使用TestPlan，它被设计为一个网络自动化系统，使此类任务变得非常简单。

回复收藏 0 原文

七婞 2024-08-13 12:17:55

如果我是你，我真的会看看Snoopy，它比curl更用户友好使用在您的 PHP 脚本中。这是一些示例代码。

<?php
    /*
    You need the snoopy.class.php from 
    http://snoopy.sourceforge.net/
    */

    include("snoopy.class.php");

    $snoopy = new Snoopy;

    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";

    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.jonasjohn.de/";

    // set some cookies:
    $snoopy->cookies["SessionID"] = '238472834723489';
    $snoopy->cookies["favoriteColor"] = "blue";

    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";

    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = false;

    // set username and password (optional)
    //$snoopy->user = "joe";
    //$snoopy->pass = "bloe";

    // fetch the text of the website www.google.com:
    if($snoopy->fetchtext("http://www.google.com")){ 
        // other methods: fetch, fetchform, fetchlinks, submittext and submitlinks

        // response code:
        print "response code: ".$snoopy->response_code."<br/>\n";

        // print the headers:

        print "<b>Headers:</b><br/>";
        while(list($key,$val) = each($snoopy->headers)){
            print $key.": ".$val."<br/>\n";
        }

        print "<br/>\n";

        // print the texts of the website:
        print "<pre>".htmlspecialchars($snoopy->results)."</pre>\n";

    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
?>

I would really have a look into Snoopy if i were you, its more user friendly than curl to use in your PHP scripts. Here is some sample code.

<?php
    /*
    You need the snoopy.class.php from 
    http://snoopy.sourceforge.net/
    */

    include("snoopy.class.php");

    $snoopy = new Snoopy;

    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";

    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.jonasjohn.de/";

    // set some cookies:
    $snoopy->cookies["SessionID"] = '238472834723489';
    $snoopy->cookies["favoriteColor"] = "blue";

    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";

    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = false;

    // set username and password (optional)
    //$snoopy->user = "joe";
    //$snoopy->pass = "bloe";

    // fetch the text of the website www.google.com:
    if($snoopy->fetchtext("http://www.google.com")){ 
        // other methods: fetch, fetchform, fetchlinks, submittext and submitlinks

        // response code:
        print "response code: ".$snoopy->response_code."<br/>\n";

        // print the headers:

        print "<b>Headers:</b><br/>";
        while(list($key,$val) = each($snoopy->headers)){
            print $key.": ".$val."<br/>\n";
        }

        print "<br/>\n";

        // print the texts of the website:
        print "<pre>".htmlspecialchars($snoopy->results)."</pre>\n";

    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
?>

回复收藏 0 原文

零度° 2024-08-13 12:17:55

使用 VietSpider Web 数据提取器。

VietSpider Web Data Extractor：软件从网站（Data Scraper）抓取数据，格式化为XML标准（Text，CDATA），然后存储在关系数据库中。产品支持各种RDBM，例如Oracle，MySQL，SQL Server，H2， HSQL、Apache Derby、Postgres ...VietSpider Crawler 支持会话（登录、通过表单输入查询）、多重下载、JavaScript 处理、代理（以及通过自动扫描网站代理的多重代理）、...

从 http://binhgiang.sourceforge.net

回复收藏 0 原文

~没有更多了~