如何抓取网站数据?

发布于 2024-08-06 12:17:55 字数 105 浏览 3 评论 0原文

因此,我经常检查我的帐户是否有不同的号码。例如,我的联属账户:我检查现金增加情况。

我想编写一个脚本,它可以登录所有这些网站,然后为我获取金钱价值并将其显示在一页上。我该如何编程?

So, often, I check my accounts for different numbers. For example, my affiliate accounts: I check for cash increase.

I want to program a script where it can login to all these websites and then grab the money value for me and display it on one page. How can I program this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

叹沉浮 2024-08-13 12:17:55

你应该看看 curl
您应该能够生成一个可以轻松检索某些网页的脚本。

另请查看 simplexmldom,它将帮助您从(X)HTML文件中提取信息。

另外 Zend_Http 可能是curl的一个很好的替代品。

干杯

you should take a look into curl.
You should be able to generate a script that retrieve some webpage easily.

Also take a look into simplexml and dom, it would help you to extract information from (X)HTML files.

Also Zend_Http could be a good alternative to curl.

Cheers

三岁铭 2024-08-13 12:17:55

好吧,这是一个模糊的问题...我建议执行以下步骤:

  • 通过 POST 发送登录凭据
  • 抓取并解析响应

  • 对所有相关帐户执行此操作/您想要检查的网站

如果您遇到特定问题,请随时对此答案发表评论

编辑:我同意 RageZ 的技术方法。卷曲对我来说也是“首选武器”...^^

hth

K

Well, sort of a vague question... I'd suggest the following steps:

  • send the login credentials via POST
  • grab and parse the response

  • do this for all relevant accounts / sites you wanna check

if you face specific problems feel free to comment on this answer

EDIT: I'd agree to RageZ in his technical approach. curl would be the 'weapon of choice' for me too... ^^

hth

K

狼性发作 2024-08-13 12:17:55

首先检查您要登录的服务是否有API。
它要容易得多,因为这是一种专门为获取数据并在其他应用程序中利用它们而设计的格式。

如果有 API,您可以查看它的文档以了解如何检索和使用数据。

如果没有,您需要废弃 HTML 页面。
您可以首先查看 Curl : http://php.net/curl
这个想法是通过发送登录帖子请求并获取给定的数据来模拟您自己对网站的访问。

检索页面数据后,您可以使用 dom 等工具解析它们。
http://php.net/dom

First of all, check if the services where you want to log in have APIs.
It's be much easier as that's a format specifically made for the purpose of getting the datas and exploiting them in an other application.

If there is an API, you can look at it's documentation to see how to retrieve and use the datas.

If there isn't any, you need to scrap the HTML pages.
You can start by taking a look at Curl : http://php.net/curl
The idea is to simulate your own visit of the website by sending the loggin post request and getting the given datas.

After retrieving the page's datas, you can parse them with tools like dom.
http://php.net/dom

烟燃烟灭 2024-08-13 12:17:55

使用TestPlan,它被设计为一个网络自动化系统,使此类任务变得非常简单。

Use TestPlan, it was designed as a web automation system and makes such tasks very simple.

七婞 2024-08-13 12:17:55

如果我是你,我真的会看看Snoopy,它比curl更用户友好使用在您的 PHP 脚本中。这是一些示例代码

<?php
    /*
    You need the snoopy.class.php from 
    http://snoopy.sourceforge.net/
    */

    include("snoopy.class.php");

    $snoopy = new Snoopy;

    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";

    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.jonasjohn.de/";

    // set some cookies:
    $snoopy->cookies["SessionID"] = '238472834723489';
    $snoopy->cookies["favoriteColor"] = "blue";

    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";

    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = false;

    // set username and password (optional)
    //$snoopy->user = "joe";
    //$snoopy->pass = "bloe";

    // fetch the text of the website www.google.com:
    if($snoopy->fetchtext("http://www.google.com")){ 
        // other methods: fetch, fetchform, fetchlinks, submittext and submitlinks

        // response code:
        print "response code: ".$snoopy->response_code."<br/>\n";

        // print the headers:

        print "<b>Headers:</b><br/>";
        while(list($key,$val) = each($snoopy->headers)){
            print $key.": ".$val."<br/>\n";
        }

        print "<br/>\n";

        // print the texts of the website:
        print "<pre>".htmlspecialchars($snoopy->results)."</pre>\n";

    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
?>

I would really have a look into Snoopy if i were you, its more user friendly than curl to use in your PHP scripts. Here is some sample code.

<?php
    /*
    You need the snoopy.class.php from 
    http://snoopy.sourceforge.net/
    */

    include("snoopy.class.php");

    $snoopy = new Snoopy;

    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";

    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.jonasjohn.de/";

    // set some cookies:
    $snoopy->cookies["SessionID"] = '238472834723489';
    $snoopy->cookies["favoriteColor"] = "blue";

    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";

    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = false;

    // set username and password (optional)
    //$snoopy->user = "joe";
    //$snoopy->pass = "bloe";

    // fetch the text of the website www.google.com:
    if($snoopy->fetchtext("http://www.google.com")){ 
        // other methods: fetch, fetchform, fetchlinks, submittext and submitlinks

        // response code:
        print "response code: ".$snoopy->response_code."<br/>\n";

        // print the headers:

        print "<b>Headers:</b><br/>";
        while(list($key,$val) = each($snoopy->headers)){
            print $key.": ".$val."<br/>\n";
        }

        print "<br/>\n";

        // print the texts of the website:
        print "<pre>".htmlspecialchars($snoopy->results)."</pre>\n";

    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
?>
零度° 2024-08-13 12:17:55

使用 VietSpider Web 数据提取器。

VietSpider Web Data Extractor:软件从网站(Data Scraper)抓取数据,格式化为XML标准(Text,CDATA),然后存储在关系数据库中。产品支持各种RDBM,例如Oracle,MySQL,SQL Server,H2, HSQL、Apache Derby、Postgres ...VietSpider Crawler 支持会话(登录、通过表单输入查询)、多重下载、JavaScript 处理、代理(以及通过自动扫描网站代理的多重代理)、...

http://binhgiang.sourceforge.net

Use VietSpider Web Data Extractor.

VietSpider Web Data Extractor: Software crawls the data from the websites (Data Scraper), format to XML standard (Text, CDATA) then store in the relational database.Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres ...VietSpider Crawler supports Session (login, query by form input), multi downloading, JavaScript handling, Proxy (and multi proxy by auto scan the proxies from website),...

Download from http://binhgiang.sourceforge.net

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文