Perl:从经过身份验证的网站抓取 HTML

发布于 2024-07-07 14:52:26 字数 436 浏览 9 评论 0原文

虽然从我所看到的情况来看,HTML 抓取有相当详细的记录,并且我了解它的概念和实现,但是从隐藏在身份验证表单后面的内容中抓取的最佳方法是什么。 我指的是从我合法访问的内容中抓取,因此我正在寻找一种自动提交登录数据的方法。

我能想到的就是设置一个代理,捕获手动登录的吞吐量,然后设置一个脚本来欺骗该吞吐量,作为 HTML 抓取执行的一部分。 就语言而言,很可能是用 Perl 完成的。

有没有人有过这方面的经验,或者只是一般想法?

编辑 这个问题已经之前回答过,但使用的是.NET。 虽然它验证了我认为应该如何完成,但有人有 Perl 脚本来执行此操作吗?

While HTML Scraping is pretty well-documented from what I can see, and I understand the concept and implementation of it, what is the best method for scraping from content that is tucked away behind authentication forms. I refer to scraping from content that I legitimately have access to, so a method for automatically submitting login data is what I'm looking for.

All I can think of is setting up a proxy, capturing the throughput from a manual login, then setting up a script to spoof that throughput as part of the HTML scraping execution. As far as language goes, it would likely be done in Perl.

Has anyone had experience with this, or just a general thought?

Edit
This has been answered before but with .NET. While it validates how I think it should be done, does anyone have Perl script to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

叹倦 2024-07-14 14:52:26

查看 Perl WWW::Mechanize 库 - 它建立在 LWP 之上,提供用于完全执行您所指的交互类型的工具,并且它可以在您进行操作时使用 cookie 维护状态!

WWW::Mechanize,简称 Mech,
帮助您自动与
网站。 它支持执行
页面获取的顺序包括
点击链接并提交表格。
解析每个获取的页面及其
提取链接和表格。 一条链接
或者可以选择表单、表单字段
可以填写,下一页可以
取来的。 机甲还存储了历史
您访问过的 URL,可以是
询问并重新审视。

Check out the Perl WWW::Mechanize library - it builds on LWP to provide tools for doing exactly the kind of interaction you refer to, and it can maintain state with cookies while you're about it!

WWW::Mechanize, or Mech for short,
helps you automate interaction with a
website. It supports performing a
sequence of page fetches including
following links and submitting forms.
Each fetched page is parsed and its
links and forms are extracted. A link
or a form can be selected, form fields
can be filled and the next page can be
fetched. Mech also stores a history of
the URLs you've visited, which can be
queried and revisited.

一场信仰旅途 2024-07-14 14:52:26

Perl 中的 LWP 模块 应该给你什么你在追赶。

这里有一篇好文章讨论启用 cookie 和其他身份验证方法,让您获得授权登录,并允许您的屏幕抓取让您进入登录墙。

The LWP Module in perl should give you what you're after.

There's a good article here which talks about enabling cookies and other authentication methods to get you an authorised login and allow your screen scrape to get you behind the log-in wall.

帅哥哥的热头脑 2024-07-14 14:52:26

经常使用的身份验证有两种类型。 基于 HTTP 的身份验证和基于表单的身份验证。

对于使用基于 HTTP 的身份验证的站点,您基本上将用户名和密码作为向服务器发出的每个 HTTP 请求的一部分发送。

对于进行基于表单的身份验证的站点,您通常需要访问登录页面,接受并存储 cookie,然后随您发出的任何 HTTP 请求一起提交 cookie 信息。

当然,也有像 stackoverflow 这样的网站使用 openid 或 saml 等外部身份验证进行身份验证。 这些报废处理起来比较复杂。 通常你想找到一个库来处理它们。

There are 2 types of authentication that are regularly used. HTTP-based authentication and form-based authentication.

For a site that uses HTTP based authentication you basically send the username and password as part of each HTTP request you make to the server.

For a site that does form-based authentication you usually need to visit the login page, accept and store the cookie, then submit the cookie information with any HTTP requests you make.

Of course there are also sites like stackoverflow that use external authentication like openid, or saml for authentication. These are more complex to deal with for scrapping. Usually you want to find a library to handle them.

落花随流水 2024-07-14 14:52:26

是的,如果您自己的语言不是 asp.net,您可以使用其他库。

例如,在 Java 中,您可以使用 httpclienthttpunit (甚至可以处理一些基本的 Javascript)。

Yes, you can use other libraries for your own language if it other than asp.net.

For example, in Java you can use httpclient or httpunit (that even handles some basic Javascript).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文