防止网站数据被抓取和窃取
我正在考虑构建一个可能包含数千个不同条目的内容网站,可以通过索引和搜索进行访问。
我可以采取哪些措施来防止恶意抓取工具窃取我网站上的所有数据? 我不太担心搜索引擎优化,尽管我不想完全阻止合法的爬虫。
例如,我考虑过随机更改用于显示数据的 HTML 结构的一小部分,但我想这不会真正有效。
I'm looking into building a content site with possibly thousands of different entries, accessible by index and by search.
What are the measures I can take to prevent malicious crawlers from ripping off all the data from my site? I'm less worried about SEO, although I wouldn't want to block legitimate crawlers all together.
For example, I thought about randomly changing small bits of the HTML structure used to display my data, but I guess it wouldn't really be effective.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
理论上,人眼可见的任何网站都可能被撕裂。 如果您甚至想尝试访问,那么根据定义,必须就是这种情况(如果您的内容不是机器可读的,那么语言浏览器将如何能够提供您的内容)。
最好的选择是考虑为您的内容添加水印,这样至少如果它确实被撕毁,您可以指向水印并声明所有权。
Any site that it visible by human eyes is, in theory, potentially rippable. If you're going to even try to be accessible then this, by definition, must be the case (how else will speaking browsers be able to deliver your content if it isn't machine readable).
Your best bet is to look into watermarking your content, so that at least if it does get ripped you can point to the watermarks and claim ownership.
在此之间:
,这:
你要求很多。 事实是,如果您要尝试阻止恶意抓取工具,那么您最终也会阻止所有“好”爬虫。
您必须记住,如果人们想要抓取您的内容,他们将比搜索引擎机器人投入更多的手动工作......因此,请确定您的优先级。 你有两个选择:
Between this:
and this:
you're asking for a lot. Fact is, if you're going to try and block malicious scrapers, you're going to end up blocking all the "good" crawlers too.
You have to remember that if people want to scrape your content, they're going to put in a lot more manual effort than a search engine bot will... So get your priorities right. You've two choices:
好的爬虫会遵循您在 robots.txt 中指定的规则,而恶意爬虫则不会。
您可以为坏机器人设置一个“陷阱”,如下所示:
http://www.fleiner.com/bots/。
但话又说回来,如果你把你的内容放在互联网上,我认为如果尽可能轻松地找到它对每个人都更好(事实上,你在这里发帖,而不是在一些专家交流的蹩脚论坛上) em>他们的意见)
Good crawlers will follow the rules you specify in your robots.txt, malicious ones will not.
You can set up a "trap" for bad robots, like it is explained here:
http://www.fleiner.com/bots/.
But then again, if you put your content on the internet, I think it's better for everyone if it's as painless as possible to find (in fact, you're posting here and not at some lame forum where experts exchange their opinions)
实际上,您无法阻止恶意爬虫 - 并且您为防止它们而采取的任何措施都可能会伤害您的合法用户(除了可能向 robots.txt 添加条目以允许检测之外),
因此您要做的就是计划内容被盗(很可能以一种或另一种形式发生),并了解您将如何处理未经授权的复制。
预防是不可能的——而且尝试预防也是浪费时间。
确保网站上的内容不易被复制的唯一可靠方法是拔掉网络电缆...
要检测它,请使用类似 http://www.copyscape.com/ 可能会有所帮助。
Realistically you can't stop malicious crawlers - and any measures that you put in place to prevent them are likely to harm your legitimate users (aside from perhaps adding entries to robots.txt to allow detection)
So what you have to do is to plan on the content being stolen - it's more than likely to happen in one form or another - and understand how you will deal with unauthorized copying.
Prevention isn't possible - and will be a waste of your time trying to make it so.
The only sure way of making sure that the content on a website isn't vulnerable to copying is to unplug the network cable...
To detect it use something like http://www.copyscape.com/ may help.
甚至不要试图在网络上设置限制!
事情确实就是这么简单。
每一个阻止翻录的潜在措施(除了非常严格的 robots.txt 之外)都会伤害您的用户。 验证码弊大于利。 检查用户代理可以阻止意外的浏览器。 对于 JavaScript 的“聪明”技巧来说也是如此。
请保持网络开放。 如果您不希望从您的网站上获取任何内容,请不要在那里发布。 水印可以帮助您声明所有权,但这仅在您在造成损害后想要起诉时才有用。
Don't even try to erect limits on the web!
It really is as simple as this.
Every potential measure to discourage ripping (aside from a very strict robots.txt) will harm your users. Captchas are more pain than gain. Checking the user agent shuts out unexpected browsers. The same is true for "clever" tricks with javascript.
Please keep the web open. If you don't want anything to be taken from your website, then do not publish it there. Watermarks can help you claim ownership, but that only helps when you want to sue after the harm is done.
阻止网站被机器破解的唯一方法是让用户证明他们是人类。
您可以让用户执行一项对人类来说容易而对机器来说困难的任务,例如:CAPTCHA。 当用户第一次访问您的网站时,请出示验证码,并仅在验证完成后才允许他们继续。 如果用户开始过快地从一个页面移动到另一个页面,请重新验证。
这并不是 100% 有效,黑客总是试图破解它们。
或者,你也可以做出缓慢的反应。 你不需要让它们爬行,但要选择一个对人类来说合理的速度(这对机器来说会非常慢)。 这只会使他们花费更长的时间来抓取您的网站,但并非不可能。
好的。 没主意了。
The only way to stop a site being machine ripped is to make the user prove that they are human.
You could make users perform a task that is easy for humans and hard for machines, eg: CAPTCHA. When a user first gets to your site present a CAPTCHA and only allow them to proceed once it has completed. If the user starts moving from page to page too quickly re-verify.
This is not 100% effective and hackers are always trying to break them.
Alternatively you could make slow responses. You don't need to make them crawl, but pick a speed that is reasonable for humans (this would be very slow for a machine). This just makes them take longer to scrape your site, but not impossible.
OK. Out of ideas.
简而言之:你无法阻止撕裂。 恶意机器人通常使用 IE 用户代理,并且现在相当智能。 如果您想让您的网站被最大数量的人(即屏幕阅读器等)访问,您不能使用 javascript 或流行的插件之一(flash),因为它们会禁止合法用户的访问。
也许你可以有一个 cron 作业,从你的数据库中随机挑选一个片段,并用谷歌搜索它来检查是否匹配。 然后,您可以尝试控制违规网站并要求他们删除内容。
您还可以监控来自给定 IP 的请求数量,并在超过阈值时阻止它,尽管您可能必须将合法机器人列入白名单,并且对僵尸网络没有用处(但如果您面对僵尸网络,也许破解是不是你最大的问题)。
In short: you cannot prevent ripping. Malicious bots commonly use IE user agents and are fairly intelligent nowadays. If you want to have your site accessible to the maximum number (ie screenreaders, etc) you cannot use javascript or one of the popular plugins (flash) simply because they can inhibit a legitimate user's access.
Perhaps you could have a cron job that picks a random snippet out of your database and googles it to check for matches. You could then try and get hold of the offending site and demand they take the content down.
You could also monitor the number of requests from a given IP and block it if it passes a threshold, although you may have to whitelist legitimate bots and would be no use against a botnet (but if you are up against a botnet, perhaps ripping is not your biggest problem).
如果你要建立一个公共网站,那就非常困难。 有些方法涉及服务器端脚本来生成内容或使用非文本(Flash 等)来最大限度地减少翻录的可能性。
但说实话,如果您认为您的内容非常好,只需用密码保护它并将其从公共领域删除即可。
我的观点是,网络的全部意义在于向尽可能多的人传播有用的内容。
If you're making a public site, then it's very difficult. There are methods that involve server-side scripting to generate content or the use of non-text (Flash, etc) to minimize the likelihood of ripping.
But to be honest, if you consider your content to be so good, just password-protect it and remove it from the public arena.
My opinion is that the whole point of the web is to propagate useful content to as many people as possible.
如果内容是公开的并且免费提供,即使有页面视图限制或其他什么,你也无能为力。 如果您需要注册和/或付款才能访问数据,您可能会对其进行一些限制,至少您可以看到谁读取了哪些内容并识别出似乎正在抓取整个数据库的用户。
然而我认为你应该面对这样一个事实:这就是网络的工作原理,没有太多方法可以阻止机器读取人类可以读取的内容。 将所有内容输出为图像当然会令人沮丧,但随后该网站将无法再访问,更不用说即使是非残疾用户也无法复制粘贴任何内容 - 这真的很烦人。
总而言之,这听起来像是 DRM/游戏保护系统——让你的合法用户生气,只是为了防止一些你无论如何都无法真正阻止的不良行为。
If the content is public and freely available, even with page view throttling or whatever, there is nothing you can do. If you require registration and/or payment to access the data, you might restrict it a bit, and at least you can see who reads what and identify the users that seem to be scraping your entire database.
However I think you should rather face the fact that this is how the net works, there are not many ways to prevent a machine to read what a human can. Outputting all your content as images would of course discourage most, but then the site is not accessible anymore, let alone the fact that even the non-disabled users will not be able to copy-paste anything - which can be really annoying.
All in all this sounds like DRM/game protection systems - pissing the hell out of your legit users only to prevent some bad behavior that you can't really prevent anyway.
您可以尝试使用Flash / Silverlight / Java来显示所有页面内容。 这可能会阻止大多数爬虫的前进。
You could try using Flash / Silverlight / Java to display all your page contents. That would probably stop most crawlers in their tracks.
我曾经有一个系统会根据用户代理标头阻止或允许。
它依赖于爬虫设置其用户代理,但似乎大多数爬虫都这样做。
当然,如果他们使用假标头来模拟流行的浏览器,那么它就不起作用。
I used to have a system that would block or allow based on the User-Agent header.
It relies on the crawler setting their User-Agent but it seems most of them do.
It won't work if they use a fake header to emulate a popular browser of course.
尽可能使用人工验证器并尝试使用某种框架(MVC)。 网站抓取软件有时无法抓取此类页面。 还要检测用户代理,至少会减少可能的破解者数量
Use where ever is possible human validators and try using some framework (MVC). The site ripping software is sometimes unable to rip this kind of page. Also detect the user agent, at least it will reduce the number of possible rippers