如何抓取网站数据?
因此,我经常检查我的帐户是否有不同的号码。例如,我的联属账户:我检查现金增加情况。
我想编写一个脚本,它可以登录所有这些网站,然后为我获取金钱价值并将其显示在一页上。我该如何编程?
So, often, I check my accounts for different numbers. For example, my affiliate accounts: I check for cash increase.
I want to program a script where it can login to all these websites and then grab the money value for me and display it on one page. How can I program this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
你应该看看 curl。
您应该能够生成一个可以轻松检索某些网页的脚本。
另请查看 simplexml 和 dom,它将帮助您从(X)HTML文件中提取信息。
另外 Zend_Http 可能是curl的一个很好的替代品。
干杯
you should take a look into curl.
You should be able to generate a script that retrieve some webpage easily.
Also take a look into simplexml and dom, it would help you to extract information from (X)HTML files.
Also Zend_Http could be a good alternative to curl.
Cheers
好吧,这是一个模糊的问题...我建议执行以下步骤:
抓取并解析响应
对所有相关帐户执行此操作/您想要检查的网站
如果您遇到特定问题,请随时对此答案发表评论
编辑:我同意 RageZ 的技术方法。卷曲对我来说也是“首选武器”...^^
hth
K
Well, sort of a vague question... I'd suggest the following steps:
grab and parse the response
do this for all relevant accounts / sites you wanna check
if you face specific problems feel free to comment on this answer
EDIT: I'd agree to RageZ in his technical approach. curl would be the 'weapon of choice' for me too... ^^
hth
K
首先检查您要登录的服务是否有API。
它要容易得多,因为这是一种专门为获取数据并在其他应用程序中利用它们而设计的格式。
如果有 API,您可以查看它的文档以了解如何检索和使用数据。
如果没有,您需要废弃 HTML 页面。
您可以首先查看 Curl : http://php.net/curl
这个想法是通过发送登录帖子请求并获取给定的数据来模拟您自己对网站的访问。
检索页面数据后,您可以使用 dom 等工具解析它们。
http://php.net/dom
First of all, check if the services where you want to log in have APIs.
It's be much easier as that's a format specifically made for the purpose of getting the datas and exploiting them in an other application.
If there is an API, you can look at it's documentation to see how to retrieve and use the datas.
If there isn't any, you need to scrap the HTML pages.
You can start by taking a look at Curl : http://php.net/curl
The idea is to simulate your own visit of the website by sending the loggin post request and getting the given datas.
After retrieving the page's datas, you can parse them with tools like dom.
http://php.net/dom
使用TestPlan,它被设计为一个网络自动化系统,使此类任务变得非常简单。
Use TestPlan, it was designed as a web automation system and makes such tasks very simple.
如果我是你,我真的会看看Snoopy,它比curl更用户友好使用在您的 PHP 脚本中。这是一些示例代码。
I would really have a look into Snoopy if i were you, its more user friendly than curl to use in your PHP scripts. Here is some sample code.
使用 VietSpider Web 数据提取器。
VietSpider Web Data Extractor:软件从网站(Data Scraper)抓取数据,格式化为XML标准(Text,CDATA),然后存储在关系数据库中。产品支持各种RDBM,例如Oracle,MySQL,SQL Server,H2, HSQL、Apache Derby、Postgres ...VietSpider Crawler 支持会话(登录、通过表单输入查询)、多重下载、JavaScript 处理、代理(以及通过自动扫描网站代理的多重代理)、...
从 http://binhgiang.sourceforge.net
Use VietSpider Web Data Extractor.
VietSpider Web Data Extractor: Software crawls the data from the websites (Data Scraper), format to XML standard (Text, CDATA) then store in the relational database.Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres ...VietSpider Crawler supports Session (login, query by form input), multi downloading, JavaScript handling, Proxy (and multi proxy by auto scan the proxies from website),...
Download from http://binhgiang.sourceforge.net