编写可维护的网络抓取应用程序的最佳方法是什么?

发布于 2024-08-10 18:14:24 字数 234 浏览 8 评论 0原文

不久前我写了一个 Perl 脚本,它登录到我的网上银行并每天通过电子邮件向我发送我的余额和迷你报表。我发现它对于跟踪我的财务状况非常有用。唯一的问题是我只使用 perl 和 curl 编写它,并且非常复杂且难以维护。在我的银行几次更改其网页后,我厌倦了调试它以使其保持最新状态。

那么以易于维护的方式编写这样的程序的最佳方法是什么?我想用 Perl 或 Java 编写一个设计精良的版本,当银行不可避免地修改其网站时,该版本将很容易更新。

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date.

So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

滥情空心 2024-08-17 18:14:24

在 Perl 中,类似 WWW::Mechanize 可以已经使您的脚本更加简单和强大,因为它可以在网站之前的响应中找到 HTML 表单。您可以填写这些表格来准备新的请求。例如:

my $mech = WWW::Mechanize->new();
$mech->get($url);
$mech->submit_form(
    form_number => 1,
    fields      => { password => $password },
);
die unless ($mech->success);

In Perl, something like WWW::Mechanize can already make your script more simple and robust, because it can find HTML forms in previous responses from the website. You can fill in these forms to prepare a new request. For example:

my $mech = WWW::Mechanize->new();
$mech->get($url);
$mech->submit_form(
    form_number => 1,
    fields      => { password => $password },
);
die unless ($mech->success);
木森分化 2024-08-17 18:14:24

A combination of WWW::Mechanize and Web::Scraper are the two tools that make me most productive. Theres a nice article about that combination at the catalyzed.org

风筝在阴天搁浅。 2024-08-17 18:14:24

如果我要给你一个建议,那就是使用 XPath 来满足你的所有抓取需求。避免正则表达式。

If I were to give you one advice, it would be to use XPath for all your scraping needs. Avoid regexes.

美胚控场 2024-08-17 18:14:24

嗯,刚刚发现

Finance::Bank::Natwest

这是专门为我的银行设计的 Perl 模块!没想到事情这么容易。

Hmm, just found

Finance::Bank::Natwest

Which is a perl module specifically for my bank! Wasn't expecting it to be quite that easy.

嘴硬脾气大 2024-08-17 18:14:24

许多银行以标准格式发布数据,MS Money 或 Quicken 等个人理财软件包通常使用该格式来下载交易信息。您可以使用相同的 API 查找该挂钩并下载,然后在您端解析数据(例如使用 Spreadsheet::ParseExcel,以及带有 Finance::QIF)。

编辑(回复评论):您是否考虑过联系您的银行并询问他们如何以编程方式登录您的帐户以下载财务数据?许多/大多数银行都有一个用于此目的的 API(Quicken 等使用了该 API,如上所述)。

A lot of banks publish their data in a standard format, which is commonly used by personal finance packages such as MS Money or Quicken to download transaction information. You could look for that hook and download using the same API, and then parse the data on your end (e.g. parse Excel documents with Spreadsheet::ParseExcel, and Quicken docs with Finance::QIF).

Edit (reply to comment): Have you considered contacting your bank and asking them how you can programmatically log into your account in order to download the financial data? Many/most banks have an API for this (which Quicken etc make use of, as described above).

眼藏柔 2024-08-17 18:14:24

这里有一个当前最新的 Ruby 实现:

http://github.com/warm/NatWoogle

There's a currently up to date Ruby implementation here:

http://github.com/warm/NatWoogle

中性美 2024-08-17 18:14:24

使用 perl 和 web::scraper 包:
链接文本

Use perl and the web::scraper package:
link text

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文