防止流氓蜘蛛索引目录
我们有一个安全的网站(使用 .NET 2.0/C# 开发,在 Windows 服务器和 IIS 5 上运行),成员必须登录该网站,然后才能查看存储在虚拟目录中的一些 PDF 文件。为了防止蜘蛛爬行此网站,我们有一个 robots.txt,它将禁止所有用户代理进入。但是,这不会阻止流氓蜘蛛对 PDF 文件建立索引,因为它们会忽略 robots.txt 命令。由于文件是安全的,我不希望任何蜘蛛进入这个虚拟目录(即使是好的蜘蛛)。
阅读网络上的一些文章,想知道程序员(而不是网络管理员)如何在他们的应用程序中解决这个问题,因为这似乎是一个非常常见的问题。网络上有很多选择,但我正在寻找简单而优雅的东西。
我见过一些选项,但似乎很弱。这里列出了它们的缺点:
创建一个蜜罐/tarpit,允许流氓蜘蛛进入,然后列出它们的 IP 地址。缺点:这也会阻止来自同一 IP 的有效用户,需要手动维护此列表或通过某种方式让成员将自己从列表中删除。我们没有有效会员可以使用的 IP 范围,因为该网站位于互联网上。
请求标头分析:但是,流氓蜘蛛使用真实的代理名称,因此这是毫无意义的。
元机器人标签:缺点:仅受谷歌和其他有效蜘蛛的约束。
有一些关于使用 .htaccess 的讨论,这应该很好,但那只是 apache,而不是 IIS。
非常感谢任何建议。
编辑:正如下面 9000 指出的那样,流氓蜘蛛不应该能够进入需要登录的页面。我想问题是“如何防止知道链接表单的人在不登录网站的情况下请求 PDF 文件”。
We have a secure website (developed in .NET 2.0/C# running on Windows server and IIS 5) to which members have to log in and then they can view some PDF files stored in a virtual directory. To prevent spiders from crawling this website, we have a robots.txt that will disallow all user agents from coming in. However, this will NOT prevent Rogue spiders from indexing the PDF files since they will disregard the robots.txt commands. Since the documents are to be secure, I do not want ANY spiders getting into this virtual directory (not even the good ones).
Read a few articles on the web and wondering how programmers (rather than web masters) have solved this problem in their applications, since this seems like a very common problem. There are many options on the web but am looking for something that is easy and elegant.
Some options that I have seen, but seem to be weak. Listed here with their cons:
Creating a Honeypot/tarpit that will allow rogue spiders to get in and then will list their IP address. Cons : this can also block valid users coming from the same IP, need to manually maintain this list or have some way for members to remove themselves from the list. We dont have a range of IPs that valid members will use, since the website is on the internet.
Request header analysis : However, the rogue spiders use real agent names so this is pointless.
Meta-Robots tag: Cons: only obeyed by google and other valid spiders.
There was some talk about using .htaccess which is suppose to be good but thats only will apache, not IIS.
Any suggestions are very much appreciated.
EDIT: as 9000 pointed out below, rogue spiders should not be able to get into a page that requires a login. I guess the question is 'how to prevent someone who knows the link form requesting the PDF file without logging into the website'.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我看到两者之间存在矛盾
,
为什么对此目录的任何未经授权的 HTTP 请求都会得到代码 401 之外的其他内容?胭脂蜘蛛当然无法提供授权 cookie。如果他们可以访问该目录,那么“会员登录”是什么?
您可能需要通过检查授权的脚本来提供 PDF 文件。我认为 IIS 也能够仅针对目录访问要求授权(但我真的不知道)。
I see a contradiction between
and
How come any unauthorized HTTP request to this directory ever gets served with something else than code 401? The rouge spiders certainly can't provide an authorization cookie. And if the directory is accessible to them, what is 'member login' then?
Probably you need to serve the PDF files via a script that checks authorization. I think IIS is capable of requiring an authorization just for a directory access, too (but I don't really know).
我假设您的 PDF 链接来自已知位置。您可以检查
Request.UrlReferrer
以确保用户来自此内部/已知页面来访问 PDF。我肯定会强制下载通过一个脚本,您可以在其中检查用户是否确实登录到该网站,然后才允许下载。
未经测试,但这至少应该给你一个想法。
我也会远离 robots.txt,因为人们经常使用它来实际查找您认为隐藏的内容。
I assume that your links to PDFs come from a known location. You can check the
Request.UrlReferrer
to make sure users are coming from this internal / known page to access the PDFs.I would definitely force downloads to go through a script where you can check that a user is in fact logged in to the site before allowing the download.
Untested, but this should give you an idea at least.
I'd also stay away from robots.txt since people will often use this to actually look for things you think you're hiding.
这就是我所做的(扩展 Leigh 的代码)。
为 PDF 文件创建了一个 HTTPHandler,在安全目录上创建了一个 web.config,并配置了该处理程序来处理 PDF。
在处理程序中,我检查用户是否使用应用程序设置的会话变量登录。
如果用户有会话变量,我会创建一个 fileInfo 对象并将其发送到响应中。注意:不要执行“context.Response.End()”,“Content-Disposition”也已过时。
现在,即使在安全目录上存在对 PDF 的请求,HTTP 处理程序也会获取该请求并检查用户是否已登录。如果没有,则显示错误消息,否则显示该文件。
不确定是否会影响性能,因为我正在创建 fileInfo 对象并发送该对象,而不是发送已存在的文件。问题是您无法 Server.Transfer 或 Response.Redirect 到 *.pdf 文件,因为您正在创建无限循环,并且响应永远不会返回给用户。
Here is what I did (expanding on Leigh's code).
Created an HTTPHandler for PDF files, created a web.config on the secure directory and configured the Handler to handle PDFs.
In the handler, I check to see if the user is logged in using a session variable set by the application.
If the user has the session variable, I create a fileInfo object and send it on the response. Note : don't do 'context.Response.End()', also the 'Content-Disposition' is obsolete.
So now, where even there is a request for a PDF on the secure directory, the HTTP handler gets the request and checks to see if the user is logged in. If not, display error message, else display the file.
Not sure if there is an performance hit since I am creating the fileInfo objects and sending that, rather than sending the file that already exists. The thing is that you can't Server.Transfer or Response.Redirect to the *.pdf file since you are creating an infinite loop and the response will never get returned to the user.