基本网站分析与 Google 数据不符
被之前的问题难倒后:SO google-analytics-domain-data-without-filtering
我一直在尝试自己的一个非常基本的分析系统。
MySQL 表:
hit_id, subsite_id, timestamp, ip, url
subsite_id 让我可以深入到一个文件夹(如上一个问题中所述)。
我现在可以获得以下指标:
- 页面浏览量 - 按 subsite_id 和日期分组
- 唯一页面浏览量 - 按 subsite_id、日期、url、IP 分组(不一定是 Google 的做法!)
- 通常的“访问量最大的页面”、“可能访问的时间” 我现在将我的数据与 Google Analytics(分析)
中的数据进行了比较,发现 Google 每个指标的值都较低。也就是说,我自己的设置计算的点击次数比谷歌还要多。
因此,我开始对来自各种网络爬虫、Google、Yahoo 和 Google 的 IP 打折。到目前为止,点机器人。
简短的问题:
- 是否值得我整理一份清单 所有主要爬虫都打折,是 有可能定期更改的列表吗?
- 还有其他明显的过滤器吗 Google 将申请 GA 数据?
- 您还想要什么其他数据 收集可能进一步有用的东西 下线?
- 变量有什么作用 谷歌用来计算入口 搜索网站的关键字?
这些数据只会在我们自己的“子网站排名系统”内部使用,但我想向我的用户展示一些基本数据(页面浏览量、最受欢迎的页面等)以供他们参考。
After being stumped by an earlier quesiton: SO google-analytics-domain-data-without-filtering
I've been experimenting with a very basic analytics system of my own.
MySQL table:
hit_id, subsite_id, timestamp, ip, url
The subsite_id let's me drill down to a folder (as explained in the previous question).
I can now get the following metrics:
- Page Views - Grouped by subsite_id and date
- Unique Page Views - Grouped by subsite_id, date, url, IP (not nesecarily how Google does it!)
- The usual "most visited page", "likely time to visit" etc etc.
I've now compared my data to that in Google Analytics and found that Google has lower values each metric. Ie, my own setup is counting more hits than Google.
So I've started discounting IP's from various web crawlers, Google, Yahoo & Dotbot so far.
Short Questions:
- Is it worth me collating a list of
all major crawlers to discount, is
any list likely to change regularly? - Are there any other obvious filters
that Google will be applying to GA
data? - What other data would you
collect that might be of use further
down the line? - What variables does
Google use to work out entrance
search keywords to a site?
The data is only going to used internally for our own "subsite ranking system", but I would like to show my users some basic data (page views, most popular pages etc) for their reference.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
许多人出于隐私原因阻止 Google Analytics。
Lots of people block Google Analytics for privacy reasons.
客户端设备与服务器端的报告不足似乎是这些比较的常见结果。
以下是我在遇到这些研究时尝试协调差异的方法:
数据源记录在服务器端收集中,但不是客户端:
来自
不支持 javascript 的移动设备(这可能是
差异的重要根源
两个集合之间
技术 - 例如,1 月 7 日 comScore
研究表明 19% 的英国人
互联网用户访问互联网
来自移动设备)
来自蜘蛛、机器人(您可以
已经提到过)
与 JavaScript 页面标签相比,服务器端收集倾向于以更高的保真度(更少的漏报)记录数据源/事件:
来自防火墙后面的用户的点击,
特别是企业
防火墙——防火墙阻止页面标签,
加上一些配置为
拒绝/删除 cookie。
来自已禁用的用户的点击次数
他们的浏览器中的 javascript--5
百分比,根据 W3C
数据
来自退出页面的用户的点击次数
在加载之前。再说一次,这是一个
差异的根源比你更大
可能会想。最
经常被引用的研究
支持这是由斯通进行的
天普咨询公司的研究表明
唯一访客的差异
两个相同站点之间的流量
配置相同的网络
分析系统,但有所不同
只是因为 js 跟踪代码是
放置在页面的底部
在一个站点中,并且在顶部
其他页面 - 4.3%
的
FWIW,这是我用来删除/识别蜘蛛、机器人等的方案:
监控我们的请求
robots.txt 文件:然后当然会过滤来自同一文件的所有其他请求
IP 地址 + 用户代理(并非全部
蜘蛛会请求robots.txt的
当然,但有微小的错误,
对此资源的任何请求都是
可能是一个机器人。
比较用户代理和 IP 地址
针对已发布的列表:iab.net 和
user-agents.org 发布了这两个
似乎是最多的列表
广泛用于此目的
模式分析:这里没什么复杂的;
我们将 (i) 页面浏览量视为
时间的函数(即,单击
很多链接,每个链接 200 毫秒
页面是证明); (ii) 路径
“用户”遍历站点,
是否系统、完整或
几乎如此(就像遵循
回溯算法); (三)
精确定时的访问(例如,凌晨 3 点
每天)。
Under-reporting by the client-side rig versus server-side eems to be the usual outcome of these comparisons.
Here's how i've tried to reconcile the disparity when i've come across these studies:
Data Sources recorded in server-side collection but not client-side:
hits from
mobile devices that don't support javascript (this is probably a
significant source of disparity
between the two collection
techniques--e.g., Jan 07 comScore
study showed that 19% of UK
Internet Users access the Internet
from a mobile device)
hits from spiders, bots (which you
mentioned already)
Data Sources/Events that server-side collection tends to record with greater fidelity (much less false negatives) compared with javascript page tags:
hits from users behind firewalls,
particularly corporate
firewalls--firewalls block page tag,
plus some are configured to
reject/delete cookies.
hits from users who have disabled
javascript in their browsers--five
percent, according to the W3C
Data
hits from users who exit the page
before it loads. Again, this is a
larger source of disparity than you
might think. The most
frequently-cited study to
support this was conducted by Stone
Temple Consulting, which showed that
the difference in unique visitor
traffic between two identical sites
configured with the same web
analytics system, but which differed
only in that the js tracking code was
placed at the bottom of the pages
in one site, and at the top of
the pages in the other--was 4.3%
FWIW, here's the scheme i use to remove/identify spiders, bots, etc.:
monitor requests for our
robots.txt file: then of course filter all other requests from same
IP address + user agent (not all
spiders will request robots.txt of
course, but with miniscule error,
any request for this resource is
probably a bot.
compare user agent and ip addresses
against published lists: iab.net and
user-agents.org publish the two
lists that seem to be the most
widely used for this purpose
pattern analysis: nothing sophisticated here;
we look at (i) page views as a
function of time (i.e., clicking a
lot of links with 200 msec on each
page is probative); (ii) the path by
which the 'user' traverses out Site,
is it systematic and complete or
nearly so (like following a
back-tracking algorithm); and (iii)
precisely-timed visits (e.g., 3 am
each day).
最大的原因是用户必须启用 JavaScript 并加载整个页面,因为代码通常位于页脚中。 Awstars,像您这样的其他服务器端解决方案将获得一切。另外,分析在识别机器人和爬虫方面确实做得很好。
Biggest reasons are users have to have JavaScript enabled and load the entire page as the code is often in the footer. Awstars, other serverside solutions like yours will get everything. Plus, analytics does a real good job identifying bots and scrapers.