如何从Google Analytics中提取数据并从中构建数据仓库(webhouse)?

发布于 2024-09-02 10:50:47 字数 357 浏览 5 评论 0原文

我在 Google Analytics 中拥有点击流数据,例如引荐 URL、热门登陆页面、热门退出页面,以及页面浏览量、访问次数、跳出率等指标。目前还没有可以存储所有这些信息的数据库。我需要从这些数据中从头开始构建一个数据仓库(我认为这被称为网络房屋)。因此,我需要从 Google Analytics 中提取数据并将其每天自动加载到仓库中。我的问题是:-

1)可能吗?每天的数据都在增加(有些是指标或措施,例如访问量,有些是新的推荐网站),加载仓库的过程将如何进行?

2)什么ETL工具可以帮助我实现这一目标?我相信Pentaho有办法从Google Analytics中提取数据,有人用过吗?这个过程是怎样进行的? 除了答案之外,任何参考文献、链接将不胜感激。

I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required to build a data warehouse from scratch(which I believe is known as web-house) from this data.So I need to extract data from Google Analytics and load it into a warehouse on a daily automated basis. My questions are:-

1)Is it possible? Every day data increases (some in terms of metrics or measures such as visits and some in terms of new referring sites), how would the process of loading the warehouse go about?

2)What ETL tool would help me to achieve this? Pentaho I believe has a way to pull out data from Google Analytics, has anyone used it? How does that process go?
Any references, links would be appreciated besides answers.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

我最亲爱的 2024-09-09 10:50:47

与往常一样,了解底层事务数据的结构(用于构建 DW 的原子组件)是第一步,也是最重要的一步。

根据您检索数据的方式,基本上有两种选择。其中之一是通过 GA API 访问您的 GA 数据,这一点已在该问题的先前答案中提到。这与数据在 GA 报告中显示的形式非常接近,而不是事务数据。使用它作为数据源的优点是“ETL”非常简单,只需解析 XML 容器中的数据即可。

第二种选择涉及获取更接近源的数据。

没有什么复杂的,不过,几行背景知识在这里可能会有帮助。

  • GA Web 仪表板由以下人员创建
    解析/过滤 GA 事务日志
    (容器
    保存的 GA 数据
    对应一个配置文件合一
    帐户)。

  • 此日志中的每一行代表一个
    单笔交易并交付
    以 GA 服务器的形式
    来自客户端的 HTTP 请求

  • 附加到该请求(即
    名义上对于单像素 GIF)是
    包含所有的单个字符串
    从那里返回的数据
    _TrackPageview 函数调用以及来自客户端 DOM 的数据、GA Cookie
    为该客户端设置,并且
    浏览器位置的内容
    bar (http://www....)。

  • 虽然这个请求来自
    客户端,由GA调用
    脚本(驻留在客户端)
    执行 GA 的主程序后立即
    数据采集​​功能
    (_TrackPageview)。

因此,直接处理这些交易数据可能是构建数据仓库最自然的方式;另一个优点是可以避免中间 API 的额外开销)。

GA 用户通常无法获取 GA 日志的各个行。不过,获得它们仍然很简单。这两个步骤就足够了:

  1. 修改网站每个页面上的 GA 跟踪代码,以便
    发送每个 GIF 请求的副本
    (GA 日志文件中的一行)到您的
    自己的服务器,具体来说,
    立即调用
    _trackPageview(),添加此行:

    pageTracker._setLocalRemoteServerMode();
    
  2. 接下来,只需放置一个单像素 gif
    文档根目录中的图像并调用
    它是“__utm.gif”

因此,现在您的服务器活动日志将包含这些单独的交易行,同样是根据附加到 GA 跟踪像素的 HTTP 请求的字符串以及请求中的其他数据(例如,用户代理字符串)构建的。前一个字符串只是键值对的串联,每个键都以字母“utm”开头(可能是“urching tracker”)。并非每个 utm 参数都会出现在每个 GIF 请求中,例如,其中一些参数仅用于电子商务交易 - 这取决于交易。

这是一个实际的 GIF 请求(帐户 ID 已被清理,否则它是完整的):

http://www.google- analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24 位&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn= 1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&& ;utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B %2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(转介)% 7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

正如您所看到的,该字符串由一组键值对组成,每个键值对之间用“&”分隔。只需两个简单的步骤:(i) 在 & 符号上拆分该字符串; (ii) 将每个 gif 参数(键)替换为简短的描述性短语,使其更易于阅读:

gatc_version 1

GIF_req_unique_id 1669045322

language_encoding UTF-8        

屏幕分辨率        1280x800

屏幕颜色深度       24位   

browser_language          en-us     

java_enabled            1          ;

flash_version            10.0%20r45

campaign_session_new      ; 1        

page_title              Position%20Listings%20%7C%20Linden%20Lab

< strong>host_name lindenlab.hrmdirect.com

referral_url        http://lindenlab.com/employment

page_request             /employment/openings.php?sort=da

account_string           UA-XXXXXX-X

cookie __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D870 45125.1274256051.1.1.utmccn%3D(转介)%7Cutmcsr%3Dlindenlab.com% 7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

这些 cookies 也很容易解析(请参阅 Google 的简明描述 此处):例如,

  • __utma 是唯一访问者 cookie,

  • __utmb、__utmc 是会话 cookie,

  • __utmz< /strong> 是引荐类型。

GA cookie 存储记录用户每次交互的大部分数据(例如,单击标记的下载链接、单击网站上另一个页面的链接、第二天的后续访问等)。例如,__utma cookie 由一组整数组成,每组由“.”分隔;最后一组是该用户的访问计数(在本例中为“1”)。

As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.

There are essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.

The second option involves grabbing the data much closer to the source.

Nothing complicated, still, a few lines of background are perhaps helpful here.

  • The GA Web Dashboard is created by
    parsing/filtering a GA transaction log
    (the container
    that holds the GA data that
    corresponds to one Profile in one
    Account).

  • Each line in this log represents a
    single transaction and is delivered
    to the GA server in the form of an
    HTTP Request from the client.

  • Appended to that Request (which is
    nominally for a single-pixel GIF) is
    a single string that contains all of
    the data returned from that
    _TrackPageview function call plus data from the client DOM, GA cookies
    set for this client, and the
    contents of the Browser's location
    bar (http://www....).

  • Though this Request is from the
    client, it is invoked by the GA
    script (which resides on the client)
    immediately after execution of GA's primary
    data-collecting function
    (_TrackPageview).

So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).

The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:

  1. modify the GA tracking code on each page of your Site so that it
    sends a copy of each GIF Request
    (one line in the GA logfile) to your
    own server, specifically,
    immeidately before the call to
    _trackPageview(), add this line:

    pageTracker._setLocalRemoteServerMode();
    
  2. Next, just put a single-pixel gif
    image in your document root and call
    it "__utm.gif"
    .

So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.

Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):

http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:

gatc_version 1

GIF_req_unique_id 1669045322

language_encoding UTF-8     

screen_resolution         1280x800  

screen_color_depth        24-bit    

browser_language          en-us     

java_enabled              1         

flash_version             10.0%20r45

campaign_session_new      1         

page_title                Position%20Listings%20%7C%20Linden%20Lab

host_name lindenlab.hrmdirect.com

referral_url        http://lindenlab.com/employment

page_request              /employment/openings.php?sort=da

account_string            UA-XXXXXX-X

cookies __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

The cookies are also simple to parse (see Google's concise description here): for instance,

  • __utma is the unique-visitor cookie,

  • __utmb, __utmc are session cookies, and

  • __utmz is the referral type.

The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).

娇女薄笑 2024-09-09 10:50:47

您可以使用 Google 或服务的数据导出 API例如我们专门根据您的需求构建的:www.analyticspros.com/products/analytics-data-warehouse.html。

最佳,

-卡莱布·惠特莫尔
www.analyticspros.com / www.analyticsformarketers.com

You can use the Data Export API from Google or a service such as the one we've built specifically for your need: www.analyticspros.com/products/analytics-data-warehouse.html.

Best,

-Caleb Whitmore
www.analyticspros.com / www.analyticsformarketers.com

正如 Shiva 所说,您始终可以通过 Google API 提取 GA 数据并自行存储。但是,如果您正在寻找经济高效的仓储工具,请尝试 Analytics Canvas @ http://www.analyticscanvas.com/

您还可以查看 Google 的应用程序库以获取与 Google Analytics(分析)相关的工具:
http://www.google.com/analytics/apps/

As Shiva said, you can always pull GA data through Google APIs and warehouse it yourself. However, if you are looking for a cost efficient warehousing tool, try out Analytics Canvas @ http://www.analyticscanvas.com/

You could also check out Google's App gallery for Google Analytics related tools:
http://www.google.com/analytics/apps/

凉栀 2024-09-09 10:50:47

您可以随时通过他们的 API 提取 GA(Google Analytics)数据并构建您自己的数据仓库(DW)。在开始之前,您可能需要与业务用户坐在一起并清楚地了解业务需求。在 DW 环境中,有一个明确的目标并了解业务用户的需求非常重要,因为您将维护长期存在且经常使用的事务历史记录。

假设业务用户定义了继续操作所需的 KPI(关键绩效指标)、指标、维度、粒度,您可以通过 GA API(位于 code.google.com/apis/analytics/docs/)检查可用的不同维度和指标。然后,只需进行正确的 API 调用并获得您需要的内容即可。 DW 活动涉及数据清理、提取、转换和加载 (ETL) 或 ELT,以及沿不同维度总结事实。由于数据比不同系统中遇到的数据(来自 Web 日志、外部供应商、Excel 或文件等)要干净得多,因此您可以通过任何 ETL 工具(例如 Talend、Pentaho、SSIS 等)轻松加载数据。 )或通过您选择的应用程序(Perl、Java、Ruby、C# 等)。

对于日常负载,您需要在低用户流量时间(夜间负载)设计增量加载流程,仅拉取最近的数据,删除重复数据,清理任何不符合要求的数据,处理错误行等。

我已经提供了示例 GA API 应用程序位于 http:// www.hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html,它将为您提供基本的入门信息。

You can pull the GA (Google Analytics) data anytime through their API and build your own data warehouse (DW). Before you start, you may want to sit with the business user and get clear understanding of business requirements. In DW environment it is extremely important to have a clear set goals and understanding of business user requirements due to fact that you will maintain history of transactions which live for long time and used often.

Assuming that business user defines KPI (Key Performance Indicators), metrics, dimensions, granularity needed for you to proceed, you can check different dimensions and metrics that are availabe through GA API at code.google.com/apis/analytics/docs/. Then it is just a matter of making the right API call and getting what you need. DW activity involves data cleaning, extraction, transformation and loading (ETL) or ELT along with summarizing the facts along different dimensions. Since data is much cleaner than one would encounter in disparate systems (from web logs, external vendors, excel or files, etc.), you can simply load the data through any ETL tools (For example, Talend, Pentaho, SSIS, etc.) or through application of your choice (Perl, Java, Ruby, C#, etc).

For daily load, you need to design incremental loading process during the low user traffic time (nightly loads), pulling only the recent data, de-duping any duplicates, cleansing any non-conforming data, handling erraneous rows, etc.

I have provided a sample GA API application at http://www.hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html and it will provide you basic info to get started.

孤独患者 2024-09-09 10:50:47

在数据仓库中加载数据有两个重要规则:

  1. 初始加载
  2. 增量数据加载

当您使用 GA api 进行设计时,您需要加载某个日期范围内的初始历史数据。这有其自身的复杂性,因为您可能会遇到分段问题、数据丢失等问题。您需要处理分页等。

初始数据加载完成后,您可以在增量模式下运行它,只带来新数据。该数据被附加到相同的数据仓库表中,并且不会导致日期重叠的重复。

除此之外,GA 经常更改他们的 API,因此您也需要掌握这一点。

考虑到上述情况,我们发布了一个完全打包的包含 Google Analytics 和 Salesforce 数据的数据仓库连接器。您可以查看详细信息并了解如何设置自己的数据仓库 http:// www.infocaptor.com/google-analytics-datawarehouse

您至少需要设计某种每天或以某种频率运行的后台守护程序。您将需要作业表来监视提取的成功和失败,以便可以从发生错误的位置恢复。

其他一些考虑因素
1. 如果对相同的数据范围运行提取会发生什么
2. 如果作业在某些日期失败怎么办

为 DW 目标表设置主键非常重要。在 MySQL 中,使用带有重复子句的 insert 语句将确保在重新加载数据时不会创建重复记录。

另一件需要设计的事情是你的暂存层。您从 GA 中提取数据并将其转储到 Staging 中。这样,如果加载到 Target 时出现错误,您只需从暂存中重新加载即可。这样您就不会增加 GA API 限制的负担,并且还可以节省带宽。

您可以在此位置查看我们的完整设计
http://www.infocaptor.com/help/social_analytics___datawarehouse.htm

祝一切顺利与您 DW 的努力。

There are two important rules about loading data in data-warehouse

  1. Initial load
  2. Incremental data load

When you design using GA api, you need to load the initial historical data for a certain date range. This has its own complications as you might run into segmentation issues, loss of data etc. You need to handle pagination etc.

Once the initial data load is complete, you then run it in incremental mode where you just bring new data only. This data gets appended to the same Data warehouse tables and does not cause duplicate with overlapping dates.

On top of this GA changes their API frequently so you need to be on top of this as well.

Considering the above, we released a fully packaged data-warehouse with Google Analytics and Salesforce data connectors. You can check out the details and get ideas on how you want to setup your own datawarehouse http://www.infocaptor.com/google-analytics-datawarehouse

The minimum you would need to design is some kind of background daemon that runs everyday or at some frequency. You will need job tables to monitor the success and failure of the extracts so that it can resume from where the error occurred.

Some of the other considerations
1. What happens if you run the extract for the same data range
2. What if a job fails for certain dates

It is important to set your primary keys for your DW target tables.In MySQL, using insert statement with duplicate clause will make sure that there are no duplicate records created in case of reloading of data.

Another thing to design is your staging layer. You extract data from GA and dump into a Staging. This way if there is error loading into Target you can simply reload from staging. This way you are not burdening your GA API limits and save bandwidth as well.

You can see our complete design at this location
http://www.infocaptor.com/help/social_analytics___datawarehouse.htm

All the best with your DW effort.

淡忘如思 2024-09-09 10:50:47

我们可以使用 Pentaho 将数据从 google Analytics 导出到任何其他数据仓库(或)RDBMS(或)Hadoop数据集成工具,pdi Kettle。 Pentaho 还提供免费的 Pentaho Community Edition 供使用。

只需按照此步骤即可转换来自谷歌分析的数据
http://wiki.pentaho.com/display/EAI/Google+Analytics

该工具在用户友好性、成本和性能方面都很好。谢谢。

We can export data from google analytics to any other data warehouse (or) RDBMS (or) Hadoop using Pentaho Data Integration tool, pdi kettle. Pentaho also have free Pentaho Community Edition to use.

Just follow this step to transform data from google analytics
http://wiki.pentaho.com/display/EAI/Google+Analytics

This tool is good by userfriendly, cost and performance wise. Thanks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文