如何从Google Analytics中提取数据并从中构建数据仓库（webhouse）？

发布于 2024-09-02 10:50:47 字数 357 浏览 10 评论 0原文

我在 Google Analytics 中拥有点击流数据，例如引荐 URL、热门登陆页面、热门退出页面，以及页面浏览量、访问次数、跳出率等指标。目前还没有可以存储所有这些信息的数据库。我需要从这些数据中从头开始构建一个数据仓库（我认为这被称为网络房屋）。因此，我需要从 Google Analytics 中提取数据并将其每天自动加载到仓库中。我的问题是：-

1）可能吗？每天的数据都在增加（有些是指标或措施，例如访问量，有些是新的推荐网站），加载仓库的过程将如何进行？

2）什么ETL工具可以帮助我实现这一目标？我相信Pentaho有办法从Google Analytics中提取数据，有人用过吗？这个过程是怎样进行的？除了答案之外，任何参考文献、链接将不胜感激。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我最亲爱的 2024-09-09 10:50:47

与往常一样，了解底层事务数据的结构（用于构建 DW 的原子组件）是第一步，也是最重要的一步。

根据您检索数据的方式，基本上有两种选择。其中之一是通过 GA API 访问您的 GA 数据，这一点已在该问题的先前答案中提到。这与数据在 GA 报告中显示的形式非常接近，而不是事务数据。使用它作为数据源的优点是“ETL”非常简单，只需解析 XML 容器中的数据即可。

第二种选择涉及获取更接近源的数据。

没有什么复杂的，不过，几行背景知识在这里可能会有帮助。

GA Web 仪表板由以下人员创建
解析/过滤 GA 事务日志
（容器
保存的 GA 数据
对应一个配置文件合一
帐户）。
此日志中的每一行代表一个
单笔交易并交付
以 GA 服务器的形式
来自客户端的 HTTP 请求。
附加到该请求（即
名义上对于单像素 GIF）是
包含所有的单个字符串
从那里返回的数据
_TrackPageview 函数调用以及来自客户端 DOM 的数据、GA Cookie
为该客户端设置，并且
浏览器位置的内容
bar (http://www....)。
虽然这个请求来自
客户端，由GA调用
脚本（驻留在客户端）
执行 GA 的主程序后立即
数据采集功能
(_TrackPageview)。

因此，直接处理这些交易数据可能是构建数据仓库最自然的方式；另一个优点是可以避免中间 API 的额外开销）。

GA 用户通常无法获取 GA 日志的各个行。不过，获得它们仍然很简单。这两个步骤就足够了：

修改网站每个页面上的 GA 跟踪代码，以便
发送每个 GIF 请求的副本
（GA 日志文件中的一行）到您的
自己的服务器，具体来说，
立即在调用
_trackPageview()，添加此行：
```
pageTracker._setLocalRemoteServerMode();
```
接下来，只需放置一个单像素 gif
文档根目录中的图像并调用
它是“__utm.gif”。

因此，现在您的服务器活动日志将包含这些单独的交易行，同样是根据附加到 GA 跟踪像素的 HTTP 请求的字符串以及请求中的其他数据（例如，用户代理字符串）构建的。前一个字符串只是键值对的串联，每个键都以字母“utm”开头（可能是“urching tracker”）。并非每个 utm 参数都会出现在每个 GIF 请求中，例如，其中一些参数仅用于电子商务交易 - 这取决于交易。

这是一个实际的 GIF 请求（帐户 ID 已被清理，否则它是完整的）：

http://www.google- analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24 位&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn= 1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&& ;utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B %2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(转介)% 7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

正如您所看到的，该字符串由一组键值对组成，每个键值对之间用“&”分隔。只需两个简单的步骤：(i) 在 & 符号上拆分该字符串； (ii) 将每个 gif 参数（键）替换为简短的描述性短语，使其更易于阅读：

gatc_version 1

GIF_req_unique_id 1669045322

language_encoding UTF-8

屏幕分辨率 1280x800

屏幕颜色深度 24位

browser_language en-us

java_enabled 1 ;

flash_version 10.0%20r45

campaign_session_new ; 1

page_title Position%20Listings%20%7C%20Linden%20Lab

< strong>host_name lindenlab.hrmdirect.com

referral_url http://lindenlab.com/employment

page_request /employment/openings.php?sort=da

account_string UA-XXXXXX-X

cookie __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D870 45125.1274256051.1.1.utmccn%3D（转介）%7Cutmcsr%3Dlindenlab.com% 7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

这些 cookies 也很容易解析（请参阅 Google 的简明描述此处）：例如，

__utma 是唯一访问者 cookie，
__utmb、__utmc 是会话 cookie，
__utmz< /strong> 是引荐类型。

GA cookie 存储记录用户每次交互的大部分数据（例如，单击标记的下载链接、单击网站上另一个页面的链接、第二天的后续访问等）。例如，__utma cookie 由一组整数组成，每组由“.”分隔；最后一组是该用户的访问计数（在本例中为“1”）。

As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.

There are essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.

The second option involves grabbing the data much closer to the source.

Nothing complicated, still, a few lines of background are perhaps helpful here.

The GA Web Dashboard is created by
parsing/filtering a GA transaction log
(the container
that holds the GA data that
corresponds to one Profile in one
Account).
Each line in this log represents a
single transaction and is delivered
to the GA server in the form of an
HTTP Request from the client.
Appended to that Request (which is
nominally for a single-pixel GIF) is
a single string that contains all of
the data returned from that
_TrackPageview function call plus data from the client DOM, GA cookies
set for this client, and the
contents of the Browser's location
bar (http://www....).
Though this Request is from the
client, it is invoked by the GA
script (which resides on the client)
immediately after execution of GA's primary
data-collecting function
(_TrackPageview).

So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).

The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:

modify the GA tracking code on each page of your Site so that it
sends a copy of each GIF Request
(one line in the GA logfile) to your
own server, specifically,
immeidately before the call to
_trackPageview(), add this line:
```
pageTracker._setLocalRemoteServerMode();
```
Next, just put a single-pixel gif
image in your document root and call
it "__utm.gif".

So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.

Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):

http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:

gatc_version 1

GIF_req_unique_id 1669045322

language_encoding UTF-8

screen_resolution 1280x800

screen_color_depth 24-bit

browser_language en-us

java_enabled 1

flash_version 10.0%20r45

campaign_session_new 1

page_title Position%20Listings%20%7C%20Linden%20Lab

host_name lindenlab.hrmdirect.com

referral_url http://lindenlab.com/employment

page_request /employment/openings.php?sort=da

account_string UA-XXXXXX-X

cookies __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

The cookies are also simple to parse (see Google's concise description here): for instance,

__utma is the unique-visitor cookie,
__utmb, __utmc are session cookies, and
__utmz is the referral type.

The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).

回复收藏 0 原文

娇女薄笑 2024-09-09 10:50:47

您可以使用 Google 或服务的数据导出 API例如我们专门根据您的需求构建的：www.analyticspros.com/products/analytics-data-warehouse.html。

最佳，

-卡莱布·惠特莫尔
www.analyticspros.com / www.analyticsformarketers.com

回复收藏 0 原文

习惯那些不曾习惯的习惯 2024-09-09 10:50:47

正如 Shiva 所说，您始终可以通过 Google API 提取 GA 数据并自行存储。但是，如果您正在寻找经济高效的仓储工具，请尝试 Analytics Canvas @ http://www.analyticscanvas.com/

您还可以查看 Google 的应用程序库以获取与 Google Analytics（分析）相关的工具：
http://www.google.com/analytics/apps/

回复收藏 0 原文

凉栀 2024-09-09 10:50:47

您可以随时通过他们的 API 提取 GA（Google Analytics）数据并构建您自己的数据仓库（DW）。在开始之前，您可能需要与业务用户坐在一起并清楚地了解业务需求。在 DW 环境中，有一个明确的目标并了解业务用户的需求非常重要，因为您将维护长期存在且经常使用的事务历史记录。

假设业务用户定义了继续操作所需的 KPI（关键绩效指标）、指标、维度、粒度，您可以通过 GA API（位于 code.google.com/apis/analytics/docs/）检查可用的不同维度和指标。然后，只需进行正确的 API 调用并获得您需要的内容即可。 DW 活动涉及数据清理、提取、转换和加载 (ETL) 或 ELT，以及沿不同维度总结事实。由于数据比不同系统中遇到的数据（来自 Web 日志、外部供应商、Excel 或文件等）要干净得多，因此您可以通过任何 ETL 工具（例如 Talend、Pentaho、SSIS 等）轻松加载数据。）或通过您选择的应用程序（Perl、Java、Ruby、C# 等）。

对于日常负载，您需要在低用户流量时间（夜间负载）设计增量加载流程，仅拉取最近的数据，删除重复数据，清理任何不符合要求的数据，处理错误行等。

我已经提供了示例 GA API 应用程序位于 http:// www.hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html，它将为您提供基本的入门信息。

回复收藏 0 原文