编写可维护的网络抓取应用程序的最佳方法是什么?
不久前我写了一个 Perl 脚本,它登录到我的网上银行并每天通过电子邮件向我发送我的余额和迷你报表。我发现它对于跟踪我的财务状况非常有用。唯一的问题是我只使用 perl 和 curl 编写它,并且非常复杂且难以维护。在我的银行几次更改其网页后,我厌倦了调试它以使其保持最新状态。
那么以易于维护的方式编写这样的程序的最佳方法是什么?我想用 Perl 或 Java 编写一个设计精良的版本,当银行不可避免地修改其网站时,该版本将很容易更新。
I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date.
So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
在 Perl 中,类似
WWW::Mechanize
可以已经使您的脚本更加简单和强大,因为它可以在网站之前的响应中找到 HTML 表单。您可以填写这些表格来准备新的请求。例如:In Perl, something like
WWW::Mechanize
can already make your script more simple and robust, because it can find HTML forms in previous responses from the website. You can fill in these forms to prepare a new request. For example:WWW::Mechanize 和 Web::Scraper 是让我效率最高的两个工具。 catalyzed 有一篇关于该组合的好文章.org
A combination of WWW::Mechanize and Web::Scraper are the two tools that make me most productive. Theres a nice article about that combination at the catalyzed.org
如果我要给你一个建议,那就是使用 XPath 来满足你的所有抓取需求。避免正则表达式。
If I were to give you one advice, it would be to use XPath for all your scraping needs. Avoid regexes.
嗯,刚刚发现
Finance::Bank::Natwest
这是专门为我的银行设计的 Perl 模块!没想到事情这么容易。
Hmm, just found
Finance::Bank::Natwest
Which is a perl module specifically for my bank! Wasn't expecting it to be quite that easy.
许多银行以标准格式发布数据,MS Money 或 Quicken 等个人理财软件包通常使用该格式来下载交易信息。您可以使用相同的 API 查找该挂钩并下载,然后在您端解析数据(例如使用 Spreadsheet::ParseExcel,以及带有 Finance::QIF)。
编辑(回复评论):您是否考虑过联系您的银行并询问他们如何以编程方式登录您的帐户以下载财务数据?许多/大多数银行都有一个用于此目的的 API(Quicken 等使用了该 API,如上所述)。
A lot of banks publish their data in a standard format, which is commonly used by personal finance packages such as MS Money or Quicken to download transaction information. You could look for that hook and download using the same API, and then parse the data on your end (e.g. parse Excel documents with Spreadsheet::ParseExcel, and Quicken docs with Finance::QIF).
Edit (reply to comment): Have you considered contacting your bank and asking them how you can programmatically log into your account in order to download the financial data? Many/most banks have an API for this (which Quicken etc make use of, as described above).
这里有一个当前最新的 Ruby 实现:
http://github.com/warm/NatWoogle
There's a currently up to date Ruby implementation here:
http://github.com/warm/NatWoogle
使用 perl 和 web::scraper 包:
链接文本
Use perl and the web::scraper package:
link text