我想开发一种非常强大的方法来仅检测少数顶级搜索引擎蜘蛛(例如 googlebot)并让它们访问我网站上的内容,否则需要通常的用户注册/登录才能查看该内容。
请注意,我还使用 cookie 来让用户无需注册即可访问某些内容。因此,如果客户端浏览器上禁用 cookie,则除首页外不会提供任何内容。但我听说搜索引擎蜘蛛不接受cookie,因此这也会将合法的搜索引擎机器人拒之门外。这是正确的吗?
我听到的一个建议是从 ip 地址进行反向查找,如果它解析为 googlebot.com,则进行正向 dns 查找,如果返回原始 ip,那么它是合法的,而不是冒充 googlebot 的人。我在linux服务器上使用Java,所以我正在寻找基于java的解决方案。
我只允许顶级搜索引擎蜘蛛(例如 google yahoo bing alexa 等)进入,而将其他蜘蛛排除在外以减少服务器负载。但它非常重要,顶级蜘蛛会索引我的网站。
I want to develop a very robust method to detect only a few top search engines spiders such as googlebot and let them access content on my site, otherwise usual user registeration/login required to view that content.
Note that I also make use of cookies to let users access some content without being registered. So if cookies are disabled on client browser, no content except front page is offered. But I heard search engine spiders dont accept cookies so this will also shut out legitimate search engine bots. Is this correct?
One suggestion I heard is to do reverse lookup from ip address and if it resolves to for example googlebot.com, then do a forward dns lookup and if get back the original ip, then its legitimate and not some one impersonating as googlebot. I am using Java on linux server , so java based solution I am looking for.
I am only letting in top good search engine spiders such as google yahoo bing alexa etc and keep the others out to reduce server loads. But its very important top spiders index my site.
发布评论
评论(5)
为了更完整地回答您的问题,您不能仅依赖一种方法。问题在于你想做的事情的本质是矛盾的。本质上,您希望允许好的机器人访问您的网站并为其建立索引,以便您可以出现在搜索引擎上;但您希望阻止坏机器人占用您的所有带宽并窃取您的信息。
第一道防线:
在网站的根目录下创建一个
robots.txt
文件。有关详细信息,请参阅 http://www.robotstxt.org/。这将使表现良好的机器人停留在网站最有意义的区域。请记住,如果您为一个机器人与另一个机器人提供不同的行为,则robots.txt
依赖于 User-Agent 字符串。请参阅 http://www.robotstxt.org/db.html第二道防线:
过滤用户代理和/或IP地址。我已经因为提出这一建议而受到批评,但令人惊讶的是,很少有机器人能够伪装自己的身份和身份,即使是那些坏机器人。同样,它不会阻止所有不良行为,但它提供了一定程度的尽职调查。稍后将详细介绍如何利用用户代理。
第三道防线:
监视 Web 服务器的访问日志。使用日志分析器找出大部分流量来自何处。这些日志包括 IP 地址和用户代理字符串,因此您可以检测有多少个机器人实例正在攻击您,以及它是否确实是它所说的那个人:请参阅 http://www.robotstxt.org/iplookup.html
您可能需要启动自己的日志分析器才能找出不同客户端的请求率。任何高于特定阈值(例如可能 10/秒)的内容都将成为稍后速率限制的候选者。
利用用户代理替代网站内容:
为了保护我们的用户免遭合法机器人的攻击,我们必须采取一种方法,那就是根据用户代理分割流量。基本上,如果用户代理是已知的浏览器,他们就会获得全功能的网站。如果它不是已知的浏览器,则会被视为机器人,并获得一组简单的 HTML 文件,其中仅包含完成工作所需的元信息和链接。该机器人的 HTML 文件每天静态生成四次,因此没有处理开销。您还可以呈现 RSS 提要,而不是提供相同功能的精简 HTML。
最后说明:
您只有这么多资源,并且并非每个合法机器人都表现良好(即忽略
robots.txt
并给您的服务器带来很大压力)。随着时间的推移,你将不得不更新你的方法。例如,如果一个 IP 地址是您的客户(或其客户)制作的自定义搜索机器人,您可能必须对该 IP 地址进行速率限制,而不是完全阻止它。本质上,您正在努力在为用户提供服务和保持网站可供搜索引擎使用之间取得良好的平衡。尽一切努力让您的网站对用户做出响应,并且仅在必要时才采用更高级的策略。
For a more complete answer to your question, you can't rely on only one approach. The problem is the conflicting nature of what you want to do. Essentially you want to allow good bots to access your site and index it so you can appear on search engines; but you want to block bad bots from sucking up all your bandwidth and stealing your information.
First line of defense:
Create a
robots.txt
file at the root of your site. See http://www.robotstxt.org/ for more information about that. This will keep good, well behaved bots in the areas of the site that make the most sense. Keep in mind thatrobots.txt
relies on the User-Agent string if you provide different behavior for one bot vs. another bot. See http://www.robotstxt.org/db.htmlSecond line of defense:
Filter on User-Agent and/or IP address. I've already been criticized for suggesting that, but it's surprising how few bots disguise who and what they are--even the bad ones. Again, it's not going to stop all bad behavior, but it provides a level of due diligence. More on leveraging User-Agent later.
Third line of defense:
Monitor your Web server's access logs. Use a log analyzer to figure out where the bulk of your traffic is comming from. These logs include both IP address and user-agent strings so you can detect how many instances of a bot is hitting you, and whether it is really who it says it is: see http://www.robotstxt.org/iplookup.html
You may have to whip up your own log analyzer to find out the request rate from different clients. Anything above a certain threshold (like maybe 10/second) would be a candidate to rate limit later on.
Leveraging User Agent for Alternative Site Content:
An approach we had to take to protect our users from even legitimate bots hammering our site is to split traffic based on the User-Agent. Basically, if the User-Agent was a known browser, they got the full featured site. If it was not a known browser it was treated as a bot, and was given a set of simple HTML files with just the meta information and links they needed to do their job. The bot's HTML files were statically generated four times a day, so there was no processing overhead. You can also render RSS feeds instead of stripped down HTML which provide the same function.
Final Note:
You only have so many resources, and not every legitimate bot is well behaved (i.e. ignores
robots.txt
and puts a lot of stress on your server). You will have to update your approach over time. For example, if one IP address turns out to be a custom search bot your client (or their client) made, you may have to resort to rate-limiting that IP address instead of blocking it completely.Essentially you are trying to get a good balance between serving your users, and keeping your site available for search engines. Do enough to keep your site responsive to the users, and only resort to the more advanced tactics as necessary.
正常的方法是配置 robots.txt 文件以允许您想要的爬虫程序,并禁止其余的爬虫程序。当然,这确实取决于爬虫是否遵守规则,但对于那些不遵守规则的爬虫,您可以依靠用户代理字符串、IP 地址检查等。
“robots.txt”的好处是
我相信是这样。请参阅 Google 对您的看法正在做。
可能会,但价格相当昂贵。 Robots.txt 是一种更简单的方法,并且在第一个实例中很容易实现。
The normal approach to this is to configure a robots.txt file to allow the crawlers that you want, and disallow the rest. Of course, this does depend on crawlers following the rules, but for those that don't you can fall back on things like user-agent strings, ip address checking, etc.
The nice things about "robots.txt" are:
I believe so. See Google's view on what you are doing.
It probably will, but it is rather expensive. Robots.txt is a simpler approach, and easy to implement in the first instance.
识别 Googlebot 的正确且快速的方法是:
仅识别为 Googlebot 的客户端为 IP/DNS 验证支付一次性费用。当然,假设您将在本地缓存每个 IP 的结果一段时间。
对于用户代理检查,您可以使用简单的 Java 字符串功能。类似
userAgent.contains("Googlebot")
根据 https:// support.google.com/webmasters/answer/1061943 或者您可以使用此库:https:// github.com/before/uaDetector关于 DNS,这是 Google 推荐的 https://support .google.com/webmasters/answer/80553
与bingbot的工作方式相同,请参阅http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26< /a>
因为我需要同样的东西,所以我将一些 Java 代码放入库中并将其发布在 GitHub 上: https://github.com/optimaize/webcrawler-verifier 它可以从 Maven Central 获取。这是一篇描述它的博客文章: http://www.flowstopper.org/2015/04/is-that-googlebot-user-agent-really-from-google.html
The correct and fast way to identify Googlebot is:
Only clients that identify as Googlebot pay a one-time price for the IP/DNS verification. Assuming that you will locally cache the result per IP for a while of course.
For the user-agent checking, you can use simple Java String functionality. Something like
userAgent.contains("Googlebot")
according to https://support.google.com/webmasters/answer/1061943 or else you can use this library: https://github.com/before/uadetectorRegarding DNS, that's what Google recommends https://support.google.com/webmasters/answer/80553
Bing works the same way with bingbot, see http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
Because I needed the same thing, I've put some Java code into a library and published it on GitHub: https://github.com/optimaize/webcrawler-verifier it's available from Maven Central. And here's a blog post describing it: http://www.flowstopper.org/2015/04/is-that-googlebot-user-agent-really-from-google.html
查看此网站:
http://www.user-agents.org/
他们还有一个 XML您可以下载并合并的数据库版本。他们根据已知的“用户代理”标头身份是否是浏览器、链接/服务器检查器、下载工具、代理服务器、机器人/蜘蛛/搜索引擎或垃圾邮件/坏机器人对它们进行分类。
注意
我遇到过一些用户代理字符串,它们代表有人为了抓取网站而组合在一起的 Java 运行时。事实证明,在这种情况下,有人正在做自己的搜索引擎抓取工具,但它也可能是一个蜘蛛,下载您的所有内容以供异地/断开连接使用。
Check out this site:
http://www.user-agents.org/
They also have an XML version of the database that you can download and incorporate. They classify known "User-Agent" header identities by whether they are a Browser, Link/Server-Checker, Downloading tool, Proxy Server, Robot/Spider/Search Engine, or Spam/bad-bot.
NOTE
I have experienced a couple User-Agent strings that represent the Java runtime someone hacked together to scrape a site. It turned out that someone was doing their own search engine scraper in that case, but it could just as well be a spider to download all your content for offsite/disconnected use.
检测 GoogleBot:https://developers.google.com/search/docs /crawling-indexing/verifying-googlebot
检测 Bing:https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26
不幸的是,GitHub 似乎没有索引可用的现成解决方案:https://github.com/search?q=googlebot.com%20search.msn.com& ;类型=存储库
Detecting GoogleBot: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
Detecting Bing: https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26
Unfortunately, GitHub seems not to index a ready solution to use: https://github.com/search?q=googlebot.com%20search.msn.com&type=repositories