用于衡量搜索爬虫的网络日志文件分析软件

发布于 2024-09-25 14:45:15 字数 184 浏览 11 评论 0原文

我需要分析我的网站中发生的搜索引擎爬行。有没有一个好的工具可以做到这一点?我尝试过 AWStats 和 Sawmill。但这两者都让我对爬行的了解非常有限。我需要知道特定爬虫在一段时间内爬行了我网站的某个部分中有多少个独特/不同的网页之类的信息。

由于其 JavaScript 跟踪机制,Google Analytics 根本不跟踪爬行。

I need to analyze the search engine crawling going on in my site. Is there a good tool for this? I've tried AWStats and Sawmill. But both of those give me very limited insight into the crawling. I need to know information like how many unique/distinct webpages in a section of my site was crawled by a specific crawler within a time period.

Google analytics doesn't track crawling at all due to its javascript tracking mechanism.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夕嗳→ 2024-10-02 14:45:15

在跟踪您网站首页的链接后,主要的搜索引擎爬网程序将首先请求一个名为 robots.txt 的文件,该文件当然会告诉搜索爬网程序它的页面站点所有者允许访问哪些文件或目录是禁止访问的。

如果您没有 robots.txt 怎么办?爬虫几乎总是将其“解释”为没有页面/目录是禁止访问的,并且它将继续爬行您的整个网站。那么,如果您想要包含 robots.txt 文件(即让爬虫为您的整个网站建立索引),为什么还要包含 robots.txt 文件呢?因为如果它在那里,爬虫程序几乎总是会请求它,以便它可以读取它——这个请求当然会在服务器访问日志文件中显示为一行,这对于爬虫程序来说是一个非常强大的签名。

其次,一个好的服务器访问日志解析器,例如 WebalyzerAwstats
将用户代理和 IP 地址与已发布的权威列表进行比较:IAB (http://www.iab.net/sites/spiders/login.php)和 user-agents.org 发布了两个似乎最广泛用于此目的的列表。前者每年花费数千美元以上;后者是免费的。

Webalyzer 和 AWStats 都可以做您想做的事情,但我推荐 AWStats 的原因如下:它是最近更新的(大约一年前),而 Webalyzer 上次更新是在八年前。此外,AWStats 有更好的报告模板。 Webalyzer 的优点是速度更快。

以下是 AWStats 的示例输出(基于开箱即用的配置),可能正是您正在寻找的内容:

“替代文本”

Upon following a link to the first page of your Site, the major Search Engine crawlers will first request a file called robots.txt which of course tells the search crawler which pages it is permitted by the Site owner to visit and which files or directories are off limits.

What if you don't have a robots.txt? Nearly always, the crawler 'interprets' this to mean that no pages/directories are off limits and it will proceed to crawl your entire Site. So why include a robots.txt file if that's what you want--i.e., for the crawler to index your entire Site? Because if it's there, the Crawler will nearly always request it so it can read it--this request of course shows up as a line in your server access log file, which is a pretty strong signature for a Crawler.

Second, a good server access log parser such as Webalyzer or Awstats.
compare user agent and ip addresses against published, authoritative lists: IAB (http://www.iab.net/sites/spiders/login.php) and the user-agents.org publish the two lists that seem to be the most widely used for this purpose. The former is a few thousand dollars per year and up; the latter is free.

Both Webalyzer and AWStats can do what you want, though i recommend AWStats for the following reasons: it was updated fairly recently (approx. one year ago) while Webalyzer was last updated over eight years ago. In addition, AWStats has much nicer report templates. The advantage of Webalyzer is that is is much faster.

Here's sample output from AWStats (based on out-of-the-box config) that is probably what you are looking for:

alt text

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文