如何允许爬虫访问封闭(私人)wiki?
我需要向爬虫程序提供对私人 wiki 的访问。
wiki 对所有匿名用户关闭 - 您必须登录才能查看内容,但我需要提供单个爬虫(由用户代理字符串和单个 IP 标识)完全访问权限,以便可以对内容建立索引。它是一个内部爬虫,因此只有在成功登录后才能访问其资源。
关于如何启用对单个客户端(而不是用户,因为爬虫无法将自己登录到维基)的访问有什么建议吗?
I need to provide access to a private wiki to a crawler.
The wiki is closed to all anonymous users - you have to log in in order to see the contents, but I need to provide a single crawler (identified by a user-agent string and a single IP) full access so the contents can be indexed. It's an internal crawler so access to its resources will only be available upon successful login.
Any suggestions on how to enable access to a single client (and not user, since a crawler is not able to log itself into the wiki)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这个问题其实是有解决办法的。
正如我提到的,爬虫将使用特定的 IP,并且只有爬虫才能使用它。如此快速和肮脏但仍然是一种文明的做法是:
简单,是吧? :)
There actually is a solution to this problem.
As I mentioned a crawler will be using a specific IP and it will only be the crawler to use it. So quick and dirty but still a civilised way to do it is:
Simple, huh? :)
如果您有权访问数据库,则可以使用 Solar 等系统中的数据库爬虫来为您执行此操作。
If you have access to the database you can use a database crawler in a system like solar to do this for you.
您可以为您的爬虫创建一个自定义用户组,假设我们将其称为“爬虫”。由于无论如何都必须登录,这将是最简单的解决方案。
只需授予它读取权限,如下所示:
参考:http://www.mediawiki.org/wiki /Manual:User_rights#Changing_group_permissions
编辑 嗯等等,我读错了。爬虫可能不是登录帐户,对吗?稍等一下,检查是否可以对IP 设置权限。
You can make a custom usergroup for your crawler, let's say we call it 'crawler'. Since it has to login anyway that'd be the easiest solution.
Just give it read permissions like this:
Reference: http://www.mediawiki.org/wiki/Manual:User_rights#Changing_group_permissions
edit Hmm wait, I misread. The crawler is probably not a logged-in account right? Hold on, checking if you can set permissions to an IP.