Perl:从经过身份验证的网站抓取 HTML
虽然从我所看到的情况来看,HTML 抓取有相当详细的记录,并且我了解它的概念和实现,但是从隐藏在身份验证表单后面的内容中抓取的最佳方法是什么。 我指的是从我合法访问的内容中抓取,因此我正在寻找一种自动提交登录数据的方法。
我能想到的就是设置一个代理,捕获手动登录的吞吐量,然后设置一个脚本来欺骗该吞吐量,作为 HTML 抓取执行的一部分。 就语言而言,很可能是用 Perl 完成的。
有没有人有过这方面的经验,或者只是一般想法?
编辑 这个问题已经之前回答过,但使用的是.NET。 虽然它验证了我认为应该如何完成,但有人有 Perl 脚本来执行此操作吗?
While HTML Scraping is pretty well-documented from what I can see, and I understand the concept and implementation of it, what is the best method for scraping from content that is tucked away behind authentication forms. I refer to scraping from content that I legitimately have access to, so a method for automatically submitting login data is what I'm looking for.
All I can think of is setting up a proxy, capturing the throughput from a manual login, then setting up a script to spoof that throughput as part of the HTML scraping execution. As far as language goes, it would likely be done in Perl.
Has anyone had experience with this, or just a general thought?
Edit
This has been answered before but with .NET. While it validates how I think it should be done, does anyone have Perl script to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
查看 Perl WWW::Mechanize 库 - 它建立在 LWP 之上,提供用于完全执行您所指的交互类型的工具,并且它可以在您进行操作时使用 cookie 维护状态!
Check out the Perl WWW::Mechanize library - it builds on LWP to provide tools for doing exactly the kind of interaction you refer to, and it can maintain state with cookies while you're about it!
Perl 中的 LWP 模块 应该给你什么你在追赶。
这里有一篇好文章讨论启用 cookie 和其他身份验证方法,让您获得授权登录,并允许您的屏幕抓取让您进入登录墙。
The LWP Module in perl should give you what you're after.
There's a good article here which talks about enabling cookies and other authentication methods to get you an authorised login and allow your screen scrape to get you behind the log-in wall.
经常使用的身份验证有两种类型。 基于 HTTP 的身份验证和基于表单的身份验证。
对于使用基于 HTTP 的身份验证的站点,您基本上将用户名和密码作为向服务器发出的每个 HTTP 请求的一部分发送。
对于进行基于表单的身份验证的站点,您通常需要访问登录页面,接受并存储 cookie,然后随您发出的任何 HTTP 请求一起提交 cookie 信息。
当然,也有像 stackoverflow 这样的网站使用 openid 或 saml 等外部身份验证进行身份验证。 这些报废处理起来比较复杂。 通常你想找到一个库来处理它们。
There are 2 types of authentication that are regularly used. HTTP-based authentication and form-based authentication.
For a site that uses HTTP based authentication you basically send the username and password as part of each HTTP request you make to the server.
For a site that does form-based authentication you usually need to visit the login page, accept and store the cookie, then submit the cookie information with any HTTP requests you make.
Of course there are also sites like stackoverflow that use external authentication like openid, or saml for authentication. These are more complex to deal with for scrapping. Usually you want to find a library to handle them.
是的,如果您自己的语言不是 asp.net,您可以使用其他库。
例如,在 Java 中,您可以使用 httpclient 或 httpunit (甚至可以处理一些基本的 Javascript)。
Yes, you can use other libraries for your own language if it other than asp.net.
For example, in Java you can use httpclient or httpunit (that even handles some basic Javascript).