计算忽略搜索引擎的页面浏览量?

发布于 2024-07-04 23:53:48 字数 255 浏览 13 评论 0原文

我注意到 StackOverflow 对每个问题都有一个浏览次数,并且这些浏览次数相当低且准确。

我的一个网站上也有类似的东西。 每当页面加载到后端代码中时,它基本上都会记录一个“点击”。 不幸的是,它也会对搜索引擎点击次数进行此操作,从而给出臃肿且不准确的数字。

我想不计算机器人的一种方法是在页面加载后使用 AJAX 调用进行视图计数,但我确信还有其他更好的方法可以忽略点击计数器中的搜索引擎,同时仍然让它们进入抓取您的网站。 你知道任何?

I notice that StackOverflow has a views count for each question and that these view numbers are fairly low and accurate.

I have a similar thing on one of my sites. It basically logs a "hit" whenever the page is loaded in the backend code. Unfortunately it also does this for search engine hits giving bloated and inaccurate numbers.

I guess one way to not count a robot would be to do the view counting with an AJAX call once the page has loaded, but I'm sure there's other, better ways to ignore search engines in your hit counters whilst still letting them in to crawl your site. Do you know any?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

不顾 2024-07-11 23:53:48

AJAX 调用即可完成此操作,但搜索引擎通常不会加载图像、javascript 或 CSS 文件,因此在页面中包含这些文件之一并传递您想要记录请求的页面的 URL 可能会更容易作为文件请求中的参数。

例如,在页面中...

http://www.example.com/example.html

您可以在 head 部分中包含

<link href="empty.css?log=example.html" rel="stylesheet" type="text/css" />

并让服务器端记录请求,然后返回一个空的 css 文件。 同样的方法也适用于 JavaScript 或图像文件,尽管在所有情况下您都需要仔细查看可能发生的缓存。

另一种选择是消除基于用户代理的搜索引擎。 http://user-agents.org/ 上有大量可能的用户代理,可以帮助您入门。 当然,您也可以采取其他方式,只计算来自您所知道的网络浏览器的请求(包括 IE、Firefox、Safari、Opera 和这个新奇的 Chrome 浏览器,您可以完成 99% 的处理)。

更简单的是使用日志分析工具,如 awstats 或服务,如 Google Analytics,两者都已经解决了这个问题。

An AJAX call will do it, but usually search engines will not load images, javascript or CSS files, so it may be easier to include one of those files in the page, and pass the URL of the page you want to log a request against as a parameter in the file request.

For example, in the page...

http://www.example.com/example.html

You might include in the head section

<link href="empty.css?log=example.html" rel="stylesheet" type="text/css" />

And have your server side log the request, then return an empty css file. The same approach would apply to JavaScript or and image file, though in all cases you'll want to look carefully at what caching might take place.

Another option would be to eliminate the search engines based on their user agent. There's a big list of possible user agents at http://user-agents.org/ to get you started. Of course, you could go the other way, and only count requests from things you know are web browsers (covering IE, Firefox, Safari, Opera and this newfangled Chrome thing would get you 99% of the way there).

Even easier would be to use a log analytics tool like awstats or a service like Google analytics, both of which have already solved this problem.

塔塔猫 2024-07-11 23:53:48

Stack Overflow 具有准确的浏览次数的原因是它只对每个浏览/用户进行一次计数。

第三方点击计数器(和网络统计)应用程序通常会过滤掉搜索引擎并将其显示在单独的窗口/选项卡/部分中。

The reason Stack Overflow has accurate view counts is that it only count each view/user once.

Third-party hit counter (and web statistics) application often filter out search engines and display them in a separate window/tab/section.

给妤﹃绝世温柔 2024-07-11 23:53:48

您要么必须使用 AJAX 执行您在问题中所说的操作。 或者排除已知搜索引擎的用户代理字符串。 阻止机器人的唯一可靠方法是使用 AJAX。

You are either going to have to do what you said in your question with AJAX. Or exclude out User-Agent strings that are known search engines. The only sure way to stop bots are with AJAX.

动次打次papapa 2024-07-11 23:53:48

Matt Sheppard 的答案的扩展可能如下所示:

  <script type="text/javascript">
  var thePg=window.location.pathname;
  var theSite=window.location.hostname;
  var theImage=new Image;
  theImage.src="/test/hitcounter.php?pg=" + thePg + "?site=" + theSite;
  </script>

它可以插入页面页眉或页脚模板中,而无需替换服务器端的页面名称。 请注意,如果您包含查询字符串 (window.location.search),则该字符串的强大版本应该对该字符串进行编码,以防止作恶者根据 URL 中的奇怪内容制作利用漏洞的页面请求。 与常规 标记或 相比,它的好处是,如果出现问题,用户将不会看到红色 x命中计数器脚本。
在某些情况下,了解浏览器在服务器端发生的重写等之前看到的 URL 也很重要,这可以为您提供这一点。 如果您想要两种方式,请添加另一个服务器端参数,该参数也会将该版本的页面名称插入到查询字符串中。

此页面测试中的日志文件示例:

10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/testpage.html HTTP/1.1" 200 306 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"
10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/hitcounter.php?pg=/test/testpage.html?site=www.home.***.com HTTP/1.1" 301 - "http://www.home.***.com/test/testpage.html" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"

An extension to Matt Sheppard's answer might be something like the following:

  <script type="text/javascript">
  var thePg=window.location.pathname;
  var theSite=window.location.hostname;
  var theImage=new Image;
  theImage.src="/test/hitcounter.php?pg=" + thePg + "?site=" + theSite;
  </script>

which can be plugged into a page header or footer template without needing to substitute the page name server-side. Note that if you include the query string (window.location.search), a robust version of this should encode the string to prevent evildoers from crafting page requests that exploit vulnerabilities based on weird stuff in URLs. The nice thing about this vs. a regular <img> tag or <iframe> is that the user won't see a red x if there is a problem with the hitcounter script.
In some cases, it's also important to know the URL that was seen by the browser, before rewrites, etc. that happen server-side, and this give you that. If you want it both ways, then add another parameter server-side that inserts that version of the page name into the query string as well.

An example of the log files from a test of this page:

10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/testpage.html HTTP/1.1" 200 306 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"
10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/hitcounter.php?pg=/test/testpage.html?site=www.home.***.com HTTP/1.1" 301 - "http://www.home.***.com/test/testpage.html" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"
孤君无依 2024-07-11 23:53:48

您实际上并不需要使用 AJAX,只需使用 JavaScript 在屏幕外添加 iFrame 即可。 保持简单

<script type="javascript">
document.write('<iframe src="myLogScript.php" style="visibility:hidden" width="1" height="1" frameborder="0">');
</script>

You don't really need to use AJAX, just use JavaScript to add an iFrame off screen. KEEP IT SIMPLE

<script type="javascript">
document.write('<iframe src="myLogScript.php" style="visibility:hidden" width="1" height="1" frameborder="0">');
</script>
欢烬 2024-07-11 23:53:48

为了解决这个问题,我实现了一个简单的过滤器,它会查看 HTTP 请求中的 User-Agent 标头,并将其与已知机器人列表进行比较。

我从 www.robotstxt.org 获取机器人列表。 它可以以简单的文本格式下载,可以轻松解析以自动生成“黑名单”。

To solve this problem I implemented a simple filter that would look at the User-Agent header in the HTTP request and compare it to a list of known robots.

I got the robot list from www.robotstxt.org. It's downloadable in a simple text-format that can easily be parsed to auto-generate the "blacklist".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文