从安全网站抓取数据或自动执行日常任务
我有一个网站,我需要使用用户名和密码以及验证码登录。
进入后,我有一个有预订的控制面板。对于每个预订,都有一个详细信息页面的链接,其中包含预订人的电子邮件地址。
每天我都需要所有这些电子邮件地址的列表来向他们发送电子邮件。
我知道如何在 .NET 中抓取网站来获取这些类型的详细信息,但不知道如何抓取需要登录的网站。
我看过一篇文章,其中我可以将 cookie 作为标头传递,这应该可以解决问题,但这需要我在 firebug 中查看 cookie 并将其复制并粘贴过来。
这会被非技术人员起诉,所以这并不是最好的选择。
我在想的另一件事是他们可以运行一个脚本来在浏览器中自动执行此操作?有关如何执行此操作的任何提示?
I have a website where I need to login with username and password and captcha.
Once in I have a control panel that has bookings. For each booking there is a link for a details page that has the email address of the person making the booking.
Each day I need a list of all these email addresses to send an email to them.
I know how to scrape sites in .NET to get these types of details but not for websites where I need to be logged in.
I seen an article where I can pass the cookie as a header and that should do the trick but that would require me to view the cookie in firebug and copy and paste it over.
This would be sued by a non technical person so that's not really the best option.
The other thing I was thinking is a script they can run that automates this in the browser? Any tips on how to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
无论您是通过
HtmlAgilityPack
查询网络还是直接使用HttpWebRequest
类(HtmlAgilityPack
使用它),您都应该知道一些事情:< strong>如何处理Cookie。您应该遵循的基本步骤如下:
HtmlAgilityPack
解析它。使用
HtmlAgilityPack
时,我总是这样做:使用HttpWebRequest
向网站发送请求,而不是使用Load(..)
HtmlWeb
类 的方法。统计一下,
HtmlDocument
类中的Load
方法的参数之一接收一个Stream
。您所要做的就是传递response
流(通过request.GetResponseStream()
获得),您将获得所需的HtmlDocument
对象。我建议您安装Fiddler。它是一个非常出色的工具,可以检查来自浏览器或应用程序的 HTTP 请求/响应。
运行
Fiddler
,并尝试通过浏览器登录站点,并查看浏览器向页面发送什么内容以及页面返回什么内容,而这正是您需要使用HttpWebRequest< 进行模拟的内容/代码> 类。
编辑:
这个想法不仅仅是在标头中传递静态Cookie。必须是登录后页面返回的Cookie。
要处理Cookie,请查看HttpWebRequest.CookieContainer 属性。这比你想象的要容易。您需要做的就是声明一个
CookieContainer
变量(空),并在向网站发送任何请求之前将其分配给该属性。当网站给出响应时,Cookie 应自动添加到该容器中,以便您下次请求该网站时能够使用它们。编辑2:
如果您需要的只是一个通过浏览器自动执行的脚本,请查看 WatiN 库。我相信在您看到一两个如何使用它的示例后,您将能够自己运行它;-)
There's something you should know, no matter if you're querying the web through
HtmlAgilityPack
or usingHttpWebRequest
class directly (HtmlAgilityPack
uses it): How to handle Cookies.Here's basically the steps you should follow:
HtmlAgilityPack
.Here's something I always do when using
HtmlAgilityPack
: Send request to the website usingHttpWebRequest
instead of doing this usingLoad(..)
method ofHtmlWeb
class.Take in count, that one of the parameters of
Load
method inHtmlDocument
class receives aStream
. All you have to do is pass theresponse
stream (obtained byrequest.GetResponseStream()
) and you will have theHtmlDocument
object you need.I suggest you installing Fiddler. It is a really great tool to inspect HTTP requests/responses, either from your browser or from your application.
Run
Fiddler
, and try to log on the site through the browser, and see what the browser sends to the page and what the page returns, and that's exactly what you need to emulate usingHttpWebRequest
class.Edit:
The idea isn't just to pass a static Cookie in the header. It must be the Cookie returned by the page after logged in.
To handle Cookies, take a look at HttpWebRequest.CookieContainer property. It's easier than you think. All you need to do is declare a
CookieContainer
variable (empty), and assign it to that property before sending any request to the website. When the website gives a response, the Cookies should be added to that container automatically, so you will be able to use them the next time you request the website.Edit 2:
If all you need is a script to automate it through your browser, take a look at WatiN library. I'm sure you will be able to run it by yourself after you see one or two examples of how to use it ;-)
要废弃 .NET 中的网站,可以使用 Html Agility Pack。
这里是解释如何使用它登录的链接: 使用 HtmlAgilityPack 获取和发布 Web 表单
To scrap a web site in .NET, there is the Html Agility Pack.
And here is the link that explains how to do login with it: Using HtmlAgilityPack to GET and POST web forms
对于自动化屏幕抓取,Selenium 是一个很好的工具。有两件事 - 1)安装 Selenium IDE(仅适用于 Firefox)。 2) 安装 Selenium RC 服务器
启动 Selenium IDE 后,转到您尝试自动化的站点并开始记录您在该站点上执行的事件。将其视为在浏览器中录制宏。然后,您将获得所需语言的代码输出。
正如您所知,Browsermob 使用 Selenium 进行负载测试和在浏览器上自动执行任务。
我上传了一份我前段时间做的ppt。这应该可以节省您大量的时间 - http://www.4shared.com/get /tlwT3qb_/SeleniumInstructions.html
在上面的链接中选择常规下载选项。
我花了很多时间来弄清楚它,所以认为这可能会节省别人的时间。
For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.
I spent good amount of time in figuring it out, so thought it may save somebody's time.