Tomcat7 和 Tomcat7 Struts1 - 处理大量 Google Bot 命中
我的一台服务器上超过一半的点击量来自 Google Bot,它不断地抓取我们数百万个页面。
我们有这么多页面的原因是该公司是一家汽车配件商店,对于制造商零件号及其适合的车辆的每种组合都有唯一的 URL。这不是我们可以摆脱的东西;人们一直在搜索这些术语,我们需要为每个术语提供独特的登陆页面(当然,因为我们所有的竞争对手都有它们!)。
因此,我们有数百万个页面需要谷歌了解。这意味着我们全天候从他们的爬虫程序中每秒获得多次点击,这种流量与任何最终用户流量一样重要和必要。
由于我们不断向目录中添加新产品,每周大约有数十万种,因此我们的唯一 URL 列表越来越长,并且流量也在稳步增加。
Google 机器人不会关注 cookie,这意味着它每次都会获得一个新会话,因此这会将我们的内存使用量提高到分配的最大值。
其他使用 Tomcat7 和 Struts 的人如何处理如此大量的自动化流量?
我计划尝试的方法是在每个请求结束时在页面页脚 JSP 磁贴中使会话无效(当且仅当用户代理字符串是 Google 抓取工具时)。这是节省内存的有效技术吗?
还有哪些其他策略可以帮助我们更有效地处理机器人流量?
More than half of the hits on one of my servers are from the Google Bot, constantly crawling our millions of pages.
The reason we have so many pages is that the company is an auto parts store, with unique URLs for every combination of manufacturer part number and the vehicles it fits. This isn't something we can get rid of; people search on these terms all the time, and we need unique landing pages for each one (because all of our competitors have them, of course!).
Thus, we have millions of pages that Google needs to know about. That means we're getting several hits per second from their crawler, round the clock, and this is traffic that's as vital and necessary as any end-user traffic.
Because we're constantly adding new products to the catalogue, on the order of hundreds of thousands per week, our list of unique URLs grows ever longer, and the traffic has been steadily increasing.
The Google bot doesn't pay any attention to cookies, which means it gets a new session every time, so this shoots up our memory usage to the maximum allocated.
How are others with Tomcat7 and Struts dealing with such massive automated traffic?
The method I plan to try is to invalidate the session at the end of each request, in the page footer JSP tile (if and only if the user agent string is the Google crawler). Is this an effective technique in saving memory?
What other strategies could help us handle bot traffic more effectively?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我并不完全在这个领域,但你有没有尝试过看看:
http://www.robotstxt.org/
我想这是谷歌应该遵守的标准。
I'm not exactly in the field, but have you tried to take a look at:
http://www.robotstxt.org/
I guess it is a standard to which google should adhere.