如何在安全站点中使用网络客户端?

发布于 2024-07-05 16:17:31 字数 216 浏览 4 评论 0原文

我需要自动化涉及使用登录表单的网站的流程。 我需要捕获登录页面后面的页面中的一些数据。

我知道如何从屏幕上抓取普通页面,但不知道如何抓取安全网站后面的页面。

  1. 这可以通过 .NET WebClient 类来完成吗?
    • 我如何自动登录?
    • 我如何在其他页面保持登录状态?

I need to automate a process involving a website that is using a login form. I need to capture some data in the pages following the login page.

I know how to screen-scrape normal pages, but not those behind a secure site.

  1. Can this be done with the .NET WebClient class?
    • How would I automatically login?
    • How would I keep logged in for the other pages?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

聚集的泪 2024-07-12 16:17:31

你能澄清一下吗? 您所说的 WebClient 类是 HTTPUnit/Java 中的类吗?

如果是这样,您的会话应该会自动保存。

Can you please clarify? Is the WebClient class you speak of the one in HTTPUnit/Java?

If so, your session should be saved automatically.

深巷少女 2024-07-12 16:17:31

一种方法是通过自动化浏览器——您提到了 WebClient,所以我猜您可能指的是 .NET 中的 WebClient。

两个要点:

  • 与 WebClient 相关的 https 没有什么特别之处 - 它只是有效
  • Cookie 通常用于进行身份验证 - 您需要捕获并重放它们

以下是我要遵循的步骤:

  1. 获取登录表单,捕获响应中的 cookie。
  2. 使用 Xpath 和 HtmlAgilityPack,查找“input type=hidden”字段名称和值。
  3. 使用请求正文中的用户名、密码和隐藏字段值 POST 到登录表单的操作。 将 cookie 包含在请求标头中。 再次捕获响应中的 cookie。
  4. 再次使用请求标头中的 cookie 获取您想要的页面。

在第 2 步中,我提到了一种有点复杂的自动登录方法。 通常,您可以将用户名和密码直接发布到已知的登录表单操作,而无需获取初始表单或中继隐藏字段。 有些网站的表单上有表单验证(与字段验证不同),这使得此方法不起作用。

HtmlAgilityPack 是一个 .NET 库,它允许您将格式错误的 html 转换为 XmlDocument,以便您可以使用 XPath超过它。 很有用。

最后,您可能会遇到表单依赖客户端脚本在提交之前更改表单值的情况。 您可能需要模拟此行为。

使用工具查看此类工作的 http 流量非常有帮助 - 我推荐 ieHttpHeadersFiddlerFireBug(网络选项卡)。

One way would be through automating a browser -- you mentioned WebClient, so I'm guessing you might be referring to WebClient in .NET.

Two main points:

  • There's nothing special about https related to WebClient - it just works
  • Cookies are typically used to carry authentication -- you'll need to capture and replay them

Here's the steps I'd follow:

  1. GET the login form, capture the the cookie in the response.
  2. Using Xpath and HtmlAgilityPack, find the "input type=hidden" field names and values.
  3. POST to login form's action with user name, password, and hidden field values in the request body. Include the cookie in the request headers. Again, capture the cookie in the response.
  4. GET the pages you want, again, with the cookie in the request headers.

On step 2, I mention a somewhat complicated method for automating the login. Usually, you can post with username and password directly to the known login form action without getting the initial form or relaying the hidden fields. Some sites have form validation (different from field validation) on their forms which makes this method not work.

HtmlAgilityPack is a .NET library that allows you to turn ill-formed html into an XmlDocument so you can XPath over it. Quite useful.

Finally, you may run into a situation where the form relies on client script to alter the form values before submitting. You may need to simulate this behavior.

Using a tool to view the http traffic for this type of work is extremely helpful - I recommend ieHttpHeaders, Fiddler, or FireBug (net tab).

惟欲睡 2024-07-12 16:17:31

您可以轻松模拟用户输入。 您可以通过向网站发送 post\get 请求来从您的程序提交网页上的表单。
典型的登录表单如下所示:

<form name="loginForm" method="post" Action="target_page.html">
   <input type="Text" name="Username">
   <input type="Password" name="Password">
</form>

您可以向网站发送发布请求,提供用户名和密码的值。 密码字段。 发送请求后会发生什么很大程度上取决于网站,通常您会被重定向到某个页面。 您的授权信息将存储在sessions\cookie中。 因此,如果您抓取客户端可以维护网络会话\理解cookie,您将能够访问受保护的页面。

从你的问题中不清楚你将使用什么语言\框架。 例如,有一个用 perl 编写的屏幕抓取框架(包括登录功能) - WWW:: Mechanize

请注意,如果您尝试登录的网站使用 java 脚本或某种验证码,您可能会遇到一些问题。

You can easily simulate user input. You can submit form on the web page from you program by sending post\get request to a website.
Typical login form looks like:

<form name="loginForm" method="post" Action="target_page.html">
   <input type="Text" name="Username">
   <input type="Password" name="Password">
</form>

You can send a post request to the website providing values for Username & Password fields. What happens after you send your request is largely depends on a website, usually you will be redirected to some page. You authorization info will be stored in the sessions\cookie. So if you scrape client can maintain web session\understands cookies you will be able to access protected pages.

It's not clear from your question what language\framework you're going to use. For example there is a framework for screen scraping (including login functionality) written in perl - WWW::Mechanize

Note, that you can face some problems if site you're trying to login to uses java scripts or some kind of CAPTCHA.

不再让梦枯萎 2024-07-12 16:17:31

从您的问题中不清楚您指的是哪种 WebClient 类(或语言)。

如果有 Java 运行时,您可以使用 Apache HttpClient 类; 这是我使用 Groovy 编写的一个示例,它通过 SSL 访问美味的 API:

   def client = new HttpClient()

   def credentials = new UsernamePasswordCredentials( "username", "password" )
   def authScope = new AuthScope("api.del.icio.us", 443, AuthScope.ANY_REALM)
   client.getState().setCredentials( authScope, credentials )

   def url = "https://api.del.icio.us/v1/posts/get"

   def method = new PostMethod( url )
   method.addParameter( "tag", tag )
   client.executeMethod( method )

It isn't clear from your question which WebClient class (or language) you are referring to.

If have a Java Runtime you can use the Apache HttpClient class; here's an example I wrote using Groovy that accesses the delicious API over SSL:

   def client = new HttpClient()

   def credentials = new UsernamePasswordCredentials( "username", "password" )
   def authScope = new AuthScope("api.del.icio.us", 443, AuthScope.ANY_REALM)
   client.getState().setCredentials( authScope, credentials )

   def url = "https://api.del.icio.us/v1/posts/get"

   def method = new PostMethod( url )
   method.addParameter( "tag", tag )
   client.executeMethod( method )
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文