了解 Google 上次抓取的时间
我想了解谷歌当前缓存的大量页面的副本情况如何。我想我需要
- 在日志中查找 IP,
- 检查以查找用户代理“googlebot”,然后
- 导出一个列表,其中显示每个页面及其上次访问时间。
我想这可能是一个每周运行的 cron 作业。如果这是正确的,我将如何编写脚本?如果这是错误的,那么更好的方法是什么?
I'd like to find out how current google's cached copy of a large set of pages is. I think I need to
- look in the logs for IP's,
- check to find user-agent "googlebot", then
- export a list that says each page and when it was last visited.
I imagine this could be a cron job that runs weekly. If this is right, how would I write the script? If this is wrong, what would be a better way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Google 已通过 Google 站点地图。我已经使用它三年了 - 效果很好。
将您的网站添加到 SiteMap,并将生成的您网站的 SiteMap XML(Google 提供免费提供此服务的网站)放在您的网络服务器上,然后让 Google 完成剩下的工作。站点地图中有一个名为抓取统计的部分,可以为您提供您想要的内容。
Google already provides this information via Google SiteMaps. I have used it for the past three years - works great.
Add your site to SiteMaps and put a generated SiteMap XML of your site (Google for websites that provide this free) on your web server, then let Google do the rest. There is section in SiteMaps called Crawl Stats that gives you what you want.
这不是必需的,您可以对 Google 进行服务调用来查找缓存的页面,即搜索 cache:stackoverflow.com,其中包括时间和日期。如果有一个 api 调用可以更直接地执行此操作,我不会感到惊讶(更新:Google 搜索API)。
That isn't necessary, you can do a service call to Google to look up the cached page, i.e. searching for cache:stackoverflow.com, which included the time and date. I wouldn't be surprised if there's an api call to do this more directly (update: Google Search API).
最后的 Googlebot Access 也可以通过 mypagerank.net 或 Google 工具栏等网站免费找到。
Last Googlebot Access can also be found for free via some websites like mypagerank.net or the Google Toolbar.