保护网站内容免受爬虫的侵害
商业网站 (ASP.NET MVC) 的内容会定期被竞争对手抓取。这些人是程序员,他们使用复杂的方法来抓取网站,因此不可能通过 IP 来识别他们。 不幸的是,用图像替换值不是一个选项,因为该网站仍应保持屏幕阅读器 (JAWS) 的可读性。
我个人的想法是使用robots.txt:禁止爬虫访问页面上的一个常见URL(这可以伪装成正常的项目详细信息链接,但对普通用户隐藏有效URL:http://example.com?itemId=1234 禁止:http://example.com?itemId=123 低于 128)。如果 IP 所有者输入了禁止链接,则会显示验证码验证。 普通用户永远不会点击这样的链接,因为它是不可见的,谷歌不必抓取它,因为它是假的。问题是屏幕阅读器仍然会读取链接,我认为这不会如此有效,值得实施。
The contents of a commerce website (ASP.NET MVC) are regularly crawled by the competition. These people are programmers and they use sophisticated methods to crawl the site so identifying them by IP is not possible.
Unfortunately replacing values with images is not an option because the site should still remain readable by screen readers (JAWS).
My personal idea is using robots.txt: prohibit crawlers from accessing one common URL on the page (this could be disguised as a normal item detail link, but hidden from normal users Valid URL: http://example.com?itemId=1234 Prohibited: http://example.com?itemId=123 under 128). If an IP owner entered the prohibited link show a CAPTCHA validation.
A normal user would never follow a link like this because it is not visible, Google does not have to crawl it because it is bogus. The issue with this is that the screen reader still reads the link and I don't think that this would be so effective to be worth implementing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的想法可能适用于一些基本的爬虫,但很容易解决。他们只需要使用代理并从新 IP 获取每个链接。
如果您允许匿名访问您的网站,那么您永远无法完全保护您的数据。即使您花费大量时间和精力设法阻止爬虫,它们也只能让人类使用 fiddler 之类的工具来浏览和捕获内容。防止竞争对手看到您的数据的最佳方法是不要将其放在网站的公共部分。
强制用户登录可能会有所帮助,至少这样您就可以找出谁在抓取您的网站并禁止他们。
Your idea could possibly work for a few basic crawlers but would be very easy to work around. They would just need to use a proxy and do a get on each link from a new IP.
If you allow anonymous access to your website then you can never fully protect your data. Even if you manage to prevent crawlers with lots of time and effort they could just get a human to browse and capture the content with something like fiddler. The best way to prevent your data being seen by your competitors would be to not put it on a public part of your website.
Forcing users to log in might help matters, at least then you could pick up who is crawling your site and ban them.
如前所述,实际上不可能向确定的用户隐藏可公开访问的数据,但是,由于这些是自动爬虫,因此您可以通过定期更改页面布局来让他们的生活变得更加困难。
可能可以使用不同的母版页来生成相同(或相似)的布局,并且您可以随机交换母版页 - 这将使自动爬虫的编写变得更加困难。
As mentioned, its not really going to be possible to hide publicly accessible data from a determined user, however, as these are automated crawlers, you could make life harder for them by altering the layout of your page regularly.
It is probably possible to use different master pages to produce the same (or similar) layouts, and you could swap in the master page on a random basis - this would make the writing of an automated crawler that bit more difficult.
我也即将进入保护我的内容免受爬虫攻击的阶段。
我正在考虑限制匿名用户可以看到该网站的内容,并要求他们注册以获得完整功能。
例如:
既然你现在了解用户,你就可以惩罚任何爬虫。
I am about to get to the phase of protecting my content from crawlers either.
I am thinking of limiting what an anonymous user can see of the website and require them to register for a full functionality.
example:
Since you know users now, you can punish any crawler.