如何从受 Shibboleth 保护的网站上抓取数据?
我正在尝试从我的大学网站中抓取数据,该网站使用 Shibboleth 作为一种身份验证/保护形式。然而,我很难确定通过它并到达我想要抓取的页面的最佳方式。我有有效的凭据,可以用来登录。有人对如何完成这项任务有什么建议吗?
I am attempting to scrape data from one of my University's websites, which uses Shibboleth as a form of authentication/protection. However, I am having difficulty determining the best way to get past it and to the page I wish to scrape. I have valid credentials, which I could use to log in with. Does anyone have any suggestions for how to accomplish this task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我一直在成功地编写 Shibboleth 登录脚本(在我的例子中,是为了监控 Shibboleth IdP 及其保护的应用程序的运行状况)。
我正在使用 Python 的 urllib 模块及其类来处理重定向跟踪和 cookie 传递(对于 Shibboleth)以及登录表单发布。经过一些修改后,urllib 就可以让您通过 Shibbolized 登录获得成功。您可以使用这种方法来处理 Shibbolized 网站的初始登录,然后直接使用 Python 的 urllib 来处理抓取。
用于登录 Shibboleth 的示例 Python 脚本
I have been working on scripting Shibbolized login with success ( in my case, to monitor the health of both the Shibboleth IdP and the applications it protects).
I am using Python's
urllib
module and their classes to handle the redirect following and cookie passing (for Shibboleth) and login form posting. After a little bit of tinkering urllib gets you most of the way to success with Shibbolized login. You could use this approach to handle the initial login to the Shibbolized website and then handle the scraping with a straight forward use of Python'surllib
.Example Python script for logging into Shibboleth
您可以使用 Mechanize 提交表单并登录网站:http://wwwsearch.sourceforge.net/mechanize/< /a>
You could use Mechanize to submit forms and login to the website: http://wwwsearch.sourceforge.net/mechanize/
我相信 ECP 配置文件旨在通过非浏览器客户端访问 Shibboleth 受保护的资源(即命令行)
尝试上面链接的 Shibboleth wiki 页面上提供的示例客户端之一
I believe that ECP profile was design to access Shibboleth protected resources by non-browser client (i.e. command line)
Try one of sample clients available on Shibboleth wiki page I linked above
您还可以尝试 Apache JMeter,只需记录您的操作,编写一些脚本(嗯,这并不那么容易shibboleth),您可以自动访问此页面。
[编辑-更好的解决方案]
我相信 Shibboleth 文档页面上有 Grinder(另一个负载测试工具)的 脚本 。该测试计划实际上是 Python(ok Jython)脚本,应该很容易修改并用于您的目的
You can also try Apache JMeter, just record your actions, make some scripting (well it is not so easy in terms of shibboleth), and you can access this pages automatically.
[Edit - better solution]
I believe that on Shibboleth Documentation pages are scripts for Grinder (another load testing tool). This test plans where in fact Python (ok Jython) scripts which should be quite easily modified and used for your purposes
回复很晚,但您可以在通过身份验证后使用 Facebook Webdriver 进行登录和抓取。
Very late reply, but you could use Facebook Webdriver to do a login and scrape after you're authenticated.