屏幕抓取表格结果
最近,一位客户要求我为他们的保险业务建立一个网站。 作为其中的一部分,他们希望对其其中一个提供商的报价网站进行一些屏幕抓取。 他们询问是否有一个 API 可以做到这一点,并被告知没有,但如果他们可以从引擎获取数据,他们就可以按照自己的意愿使用它。
我的问题:是否可以对提交到另一个网站的表单的响应执行屏幕抓取? 如果是这样,我应该注意哪些问题。 撇开明显的法律/道德问题不谈,因为他们已经请求许可做我们计划做的事情。
顺便说一句,我更喜欢在 python 中进行任何处理。
谢谢
I was recently requested by a client to build a website for their insurance business. As part of this, they want to do some screen scraping of the quote site for one of their providers. They asked if their was an API to do this, and were told there wasn't one, but that if they could get the data from their engine they could use it as they wanted to.
My question: is it even possible to perform screen scraping on the response to a form submission to another site? If so, what are the gotchas that I should look out for. Obvious legal/ethical issues aside since they already asked for permission to do what we're planning to do.
As an aside, I would prefer to do any processing in python.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以将
data
参数传递给urllib .urlopen
通过请求发送 POST 数据,就像您填写表单一样。 显然,您必须查看表单到底包含哪些数据。另外,如果表单具有
method="GET"
,则请求数据只是提供给urlopen
的 URL 的一部分。抓取返回的 HTML 数据的几乎标准是 BeautifulSoup。
You can pass a
data
parameter tourllib.urlopen
to send POST data with the request just like you had filled out the form. You'll obviously have to take a look at what data exactly the form contains.Also, if the form has
method="GET"
, the request data is just part of the url given tourlopen
.Pretty much standard for scraping the returned HTML data is BeautifulSoup.
我看到其他两个答案已经提到了为此目的选择的所有主要库......只要被抓取的网站不广泛使用 Javascript,就是这样。 如果它是一个大量使用 Javascript 的网站,并且依赖 JS 来获取和显示数据(例如通过 AJAX),那么您的问题就会困难一个数量级; 在这种情况下,我可能建议从 crowbar、diggstripper 的一些自定义,或 < a href="http://www.jroller.com/bjornmartensson/entry/web_scraping_in_early_2007" rel="nofollow noreferrer">selenium 等。
您必须在 Javascript 方面做大量工作,并且可能致力于处理相关站点(假设大量使用 JS)的具体情况,具体取决于它使用的 JS 框架等; 这就是为什么如果是这样的话,工作就会变得更加困难。 但无论如何,您最终可能会得到(至少部分)所显示的网站页面的本地 HTML 副本,并最终使用已推荐的其他工具抓取这些副本。 祝你好运:愿你抓取的网站始终是轻量级 Javascript!-)
I see the other two answers already mention all the major libraries of choice for the purpose... as long as the site being scraped does not make extensive use of Javascript, that is. If it IS a Javascript-heavy site and dependent on JS for the data it fetches and display (e.g. via AJAX) your problem is an order of magnitude harder; in that case, I might suggest starting with crowbar, some customization of diggstripper, or selenium, etc.
You'll have to do substantial work in Javascript and probably dedicated work to deal with the specifics of the (hypothetically JS-heavy) site in question, depending on the JS frameworks it uses, etc; that's why the job is so much harder if that is the case. But in any case you might end up with (at least in part) local HTML copies of the site's pages as displayed, and end by scraping those copies with the other tools already recommended. Good luck: may the sites you scrape always be Javascript-light!-)
其他人推荐了 BeautifulSoup,但使用 lxml 更好。 尽管它的名字如此,它也用于解析和抓取 HTML。 它比 BeautifulSoup 快得多,甚至比 BeautifulSoup(他们声名鹊起)更好地处理“损坏的”HTML。 如果您不想学习 lxml API,它也有一个 BeautifulSoup 的兼容性 API。
Ian Blicking 同意。
没有理由再使用 BeautifulSoup,除非你使用的是 Google App Engine 或其他不允许使用非纯 Python 的东西。
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
一个非常好的屏幕抓取库是 mechanize,我相信它是编写的原始库的克隆在 Perl 中。 不管怎样,结合 ClientForm 模块,以及来自 BeautifulSoup 的一些额外帮助,你应该离开。
我已经用 Python 编写了大量的屏幕抓取代码,这些模块被证明是最有用的。 理论上, mechanize 所做的大部分事情都可以通过简单地使用 urllib2 或 httplib 来自标准库的模块,但是 mechanize 使这些东西变得轻而易举:本质上,它为您提供了一个编程浏览器(注意,它不需要浏览器即可工作,但仅仅为您提供了一个行为类似于完全可定制的浏览器的 API)。
对于后处理,我使用 BeautifulSoup 取得了很大的成功,但是 lxml.html 是也是一个不错的选择。
基本上,您肯定能够在 Python 中完成此操作,并且使用现有的一系列工具您的结果应该非常好。
A really nice library for screen-scraping is mechanize, which I believe is a clone of an original library written in Perl. Anyway, that in combination with the ClientForm module, and some additional help from either BeautifulSoup and you should be away.
I've written loads of screen-scraping code in Python and these modules turned out to be the most useful. Most of the stuff that mechanize does could in theory be done by simply using the urllib2 or httplib modules from the standard library, but mechanize makes this stuff a breeze: essentially it gives you a programmatic browser (note, it does not require a browser to work, but mearly provides you with an API that behaves like a completely customisable browser).
For post-processing, I've had a lot of success with BeautifulSoup, but lxml.html is a good choice too.
Basically, you will be able to do this in Python for sure, and your results should be really good with the range of tools out there.