如何在安全站点中使用网络客户端?
我需要自动化涉及使用登录表单的网站的流程。 我需要捕获登录页面后面的页面中的一些数据。
我知道如何从屏幕上抓取普通页面,但不知道如何抓取安全网站后面的页面。
- 这可以通过 .NET WebClient 类来完成吗?
- 我如何自动登录?
- 我如何在其他页面保持登录状态?
I need to automate a process involving a website that is using a login form. I need to capture some data in the pages following the login page.
I know how to screen-scrape normal pages, but not those behind a secure site.
- Can this be done with the .NET WebClient class?
- How would I automatically login?
- How would I keep logged in for the other pages?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你能澄清一下吗? 您所说的 WebClient 类是 HTTPUnit/Java 中的类吗?
如果是这样,您的会话应该会自动保存。
Can you please clarify? Is the WebClient class you speak of the one in HTTPUnit/Java?
If so, your session should be saved automatically.
一种方法是通过自动化浏览器——您提到了 WebClient,所以我猜您可能指的是 .NET 中的 WebClient。
两个要点:
以下是我要遵循的步骤:
在第 2 步中,我提到了一种有点复杂的自动登录方法。 通常,您可以将用户名和密码直接发布到已知的登录表单操作,而无需获取初始表单或中继隐藏字段。 有些网站的表单上有表单验证(与字段验证不同),这使得此方法不起作用。
HtmlAgilityPack 是一个 .NET 库,它允许您将格式错误的 html 转换为 XmlDocument,以便您可以使用 XPath超过它。 很有用。
最后,您可能会遇到表单依赖客户端脚本在提交之前更改表单值的情况。 您可能需要模拟此行为。
使用工具查看此类工作的 http 流量非常有帮助 - 我推荐 ieHttpHeaders 、Fiddler 或 FireBug(网络选项卡)。
One way would be through automating a browser -- you mentioned WebClient, so I'm guessing you might be referring to WebClient in .NET.
Two main points:
Here's the steps I'd follow:
On step 2, I mention a somewhat complicated method for automating the login. Usually, you can post with username and password directly to the known login form action without getting the initial form or relaying the hidden fields. Some sites have form validation (different from field validation) on their forms which makes this method not work.
HtmlAgilityPack is a .NET library that allows you to turn ill-formed html into an XmlDocument so you can XPath over it. Quite useful.
Finally, you may run into a situation where the form relies on client script to alter the form values before submitting. You may need to simulate this behavior.
Using a tool to view the http traffic for this type of work is extremely helpful - I recommend ieHttpHeaders, Fiddler, or FireBug (net tab).
您可以轻松模拟用户输入。 您可以通过向网站发送 post\get 请求来从您的程序提交网页上的表单。
典型的登录表单如下所示:
您可以向网站发送发布请求,提供用户名和密码的值。 密码字段。 发送请求后会发生什么很大程度上取决于网站,通常您会被重定向到某个页面。 您的授权信息将存储在sessions\cookie中。 因此,如果您抓取客户端可以维护网络会话\理解cookie,您将能够访问受保护的页面。
从你的问题中不清楚你将使用什么语言\框架。 例如,有一个用 perl 编写的屏幕抓取框架(包括登录功能) - WWW:: Mechanize
请注意,如果您尝试登录的网站使用 java 脚本或某种验证码,您可能会遇到一些问题。
You can easily simulate user input. You can submit form on the web page from you program by sending post\get request to a website.
Typical login form looks like:
You can send a post request to the website providing values for Username & Password fields. What happens after you send your request is largely depends on a website, usually you will be redirected to some page. You authorization info will be stored in the sessions\cookie. So if you scrape client can maintain web session\understands cookies you will be able to access protected pages.
It's not clear from your question what language\framework you're going to use. For example there is a framework for screen scraping (including login functionality) written in perl - WWW::Mechanize
Note, that you can face some problems if site you're trying to login to uses java scripts or some kind of CAPTCHA.
从您的问题中不清楚您指的是哪种 WebClient 类(或语言)。
如果有 Java 运行时,您可以使用 Apache HttpClient 类; 这是我使用 Groovy 编写的一个示例,它通过 SSL 访问美味的 API:
It isn't clear from your question which WebClient class (or language) you are referring to.
If have a Java Runtime you can use the Apache HttpClient class; here's an example I wrote using Groovy that accesses the delicious API over SSL: