Expressjs Node.js 为 google/etc 机器人和人员流量提供不同的数据
我想确定传入请求是来自机器人(例如 google、bing)还是人类,并为每个请求提供不同的数据,例如用于客户端 JavaScript 构建网站或预处理 html 的 json 数据。
使用expressjs,有没有简单的方法来做到这一点?谢谢。
I want to determine if incoming requests are from a bot (eg google, bing), or a human, and serve different data to each, for example, json data for client javascript to construct the site or preprocessed html.
Using expressjs, is there an easy way to do this? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以检查“Mozilla/5.0(兼容;Googlebot/2.1;+http://www.google.com/bot.html'。如果您知道这是 Google,并且可以向其发送不同的数据。
http://www.google.com/support/webmasters/bin/answer .py?answer=1061943
如何获取标头
http://expressjs.com/4x/api.html#req.get
You can check the req.header('User-Agent') for 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html'. If it's that you know it's Google and can send it different data.
http://www.google.com/support/webmasters/bin/answer.py?answer=1061943
How to get headers
http://expressjs.com/4x/api.html#req.get
我建议您根据请求的 MIME 类型(存在于“Accept”标头中)进行响应。您可以通过 Express 执行此操作:
I recommend you to response according to the requested MIME type (which is present in the "Accept" header). You can do this with Express this way:
按照建议检查请求标头
User-Agent
或 MIME 类型并不可靠,因为任何 HTTP GET 请求 可以随意定义User-Agent
和 headers。最可靠、最安全的方法是通过IP进行检查。
因此,我开发了一个 NPM 包来实现这一点。它会在启动时将所有已知来自 Google bot 的 IP 范围存储在内存中爬虫,用于非常快速的中间件处理。
Checking for request header
User-Agent
or MIME type as suggested is not reliable, since any HTTP GET request can defineUser-Agent
and headers at will.The most reliable and secure approach is to check by IP.
Therefore I developed an NPM package that does exactly that. It stores at startup in-memory all known IP ranges coming from Google bots and crawlers, for very fast middleware processing.