从安全网站抓取数据或自动执行日常任务

发布于 2024-10-19 18:02:36 字数 363 浏览 5 评论 0原文

我有一个网站,我需要使用用户名和密码以及验证码登录。

进入后,我有一个有预订的控制面板。对于每个预订,都有一个详细信息页面的链接,其中包含预订人的电子邮件地址。

每天我都需要所有这些电子邮件地址的列表来向他们发送电子邮件。

我知道如何在 .NET 中抓取网站来获取这些类型的详细信息,但不知道如何抓取需要登录的网站。

我看过一篇文章,其中我可以将 cookie 作为标头传递,这应该可以解决问题,但这需要我在 firebug 中查看 cookie 并将其复制并粘贴过来。

这会被非技术人员起诉,所以这并不是最好的选择。

我在想的另一件事是他们可以运行一个脚本来在浏览器中自动执行此操作?有关如何执行此操作的任何提示?

I have a website where I need to login with username and password and captcha.

Once in I have a control panel that has bookings. For each booking there is a link for a details page that has the email address of the person making the booking.

Each day I need a list of all these email addresses to send an email to them.

I know how to scrape sites in .NET to get these types of details but not for websites where I need to be logged in.

I seen an article where I can pass the cookie as a header and that should do the trick but that would require me to view the cookie in firebug and copy and paste it over.

This would be sued by a non technical person so that's not really the best option.

The other thing I was thinking is a script they can run that automates this in the browser? Any tips on how to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

苯莒 2024-10-26 18:02:36

无论您是通过 HtmlAgilityPack 查询网络还是直接使用 HttpWebRequest 类(HtmlAgilityPack 使用它),您都应该知道一些事情:< strong>如何处理Cookie

您应该遵循的基本步骤如下:

  • 加载您想要登录的页面
  • 使用 POST 方法提交登录所需的信息(用户名、密码或页面请求的任何内容)
  • 保存 响应中的 Cookie,并从现在开始使用这些 Cookie。
  • 使用这些 Cookie 请求页面,并使用 HtmlAgilityPack 解析它。

使用 HtmlAgilityPack 时,我总是这样做:使用 HttpWebRequest 向网站发送请求,而不是使用 Load(..) HtmlWeb 的方法。

统计一下,HtmlDocument 类中的 Load 方法的参数之一接收一个 Stream。您所要做的就是传递 response 流(通过 request.GetResponseStream() 获得),您将获得所需的 HtmlDocument 对象。

我建议您安装Fiddler。它是一个非常出色的工具,可以检查来自浏览器或应用程序的 HTTP 请求/响应。

运行 Fiddler,并尝试通过浏览器登录站点,并查看浏览器向页面发送什么内容以及页面返回什么内容,而这正是您需要使用 HttpWebRequest< 进行模拟的内容/代码> 类。

编辑:

这个想法不仅仅是在标头中传递静态Cookie。必须是登录后页面返回的Cookie

要处理Cookie,请查看HttpWebRequest.CookieContainer 属性。这比你想象的要容易。您需要做的就是声明一个 CookieContainer 变量(空),并在向网站发送任何请求之前将其分配给该属性。当网站给出响应时,Cookie 应自动添加到该容器中,以便您下次请求该网站时能够使用它们。

编辑2:

如果您需要的只是一个通过浏览器自动执行的脚本,请查看 WatiN 库。我相信在您看到一两个如何使用它的示例后,您将能够自己运行它;-)

There's something you should know, no matter if you're querying the web through HtmlAgilityPack or using HttpWebRequest class directly (HtmlAgilityPack uses it): How to handle Cookies.

Here's basically the steps you should follow:

  • Load the page you want to be logged in
  • Submit the required info to log in using POST method (username, password, or whatever the page requests)
  • Save the Cookies in the response, and use those Cookies from now on.
  • Request the page with those Cookies and parse it with HtmlAgilityPack.

Here's something I always do when using HtmlAgilityPack: Send request to the website using HttpWebRequest instead of doing this using Load(..) method of HtmlWeb class.

Take in count, that one of the parameters of Load method in HtmlDocument class receives a Stream. All you have to do is pass the response stream (obtained by request.GetResponseStream()) and you will have the HtmlDocument object you need.

I suggest you installing Fiddler. It is a really great tool to inspect HTTP requests/responses, either from your browser or from your application.

Run Fiddler, and try to log on the site through the browser, and see what the browser sends to the page and what the page returns, and that's exactly what you need to emulate using HttpWebRequest class.

Edit:

The idea isn't just to pass a static Cookie in the header. It must be the Cookie returned by the page after logged in.

To handle Cookies, take a look at HttpWebRequest.CookieContainer property. It's easier than you think. All you need to do is declare a CookieContainer variable (empty), and assign it to that property before sending any request to the website. When the website gives a response, the Cookies should be added to that container automatically, so you will be able to use them the next time you request the website.

Edit 2:

If all you need is a script to automate it through your browser, take a look at WatiN library. I'm sure you will be able to run it by yourself after you see one or two examples of how to use it ;-)

你是暖光i 2024-10-26 18:02:36

要废弃 .NET 中的网站,可以使用 Html Agility Pack

这里是解释如何使用它登录的链接: 使用 HtmlAgilityPack 获取和发布 Web 表单

To scrap a web site in .NET, there is the Html Agility Pack.

And here is the link that explains how to do login with it: Using HtmlAgilityPack to GET and POST web forms

三寸金莲 2024-10-26 18:02:36

对于自动化屏幕抓取,Selenium 是一个很好的工具。有两件事 - 1)安装 Selenium IDE(仅适用于 Firefox)。 2) 安装 Selenium RC 服务器

启动 Selenium IDE 后,转到您尝试自动化的站点并开始记录您在该站点上执行的事件。将其视为在浏览器中录制宏。然后,您将获得所需语言的代码输出。

正如您所知,Browsermob 使用 Selenium 进行负载测试和在浏览器上自动执行任务。

我上传了一份我前段时间做的ppt。这应该可以节省您大量的时间 - http://www.4shared.com/get /tlwT3qb_/SeleniumInstructions.html

在上面的链接中选择常规下载选项。

我花了很多时间来弄清楚它,所以认为这可能会节省别人的时间。

For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server

After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.

Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.

I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html

In the above link select the option of regular download.

I spent good amount of time in figuring it out, so thought it may save somebody's time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文