选择 DB、OLAP 解决方案进行快速 Web 分析(大数据阵列)

发布于 2024-11-28 03:30:06 字数 333 浏览 4 评论 0原文

我有以下问题:我的系统每天从不同站点收集约 3 亿次点击。 每个都有时间、用户 ID、类型(广告或普通)、http 地址、站点 ID。

还有一个用户数组~200M,其中有性别、年龄组和国家。

需要设计一个基于点击数据的系统,能够实时报告不同用户群的点击情况。类似于 OLAP 解决方案 :-)

例如,绘制 2011 年 10 月至 9 月期间英国 15-25 岁女孩的点击率图表。

您建议选择哪个数据库以及构建 OLAP 多维数据集的解决方案?

我正在寻找开源解决方案,例如 HBase(+ zohmg 或级联)Hypertable 或其他东西(免费 DWH :-))。

I have the following problem: my system collects daily ~300M hits from different sites.
Every has time, user id, type (ad or usual), http address, site id.

There is also an array of users ~ 200M, which has gender, age group and country.

Required to design a system that is based on data by hits could give real-time reports on the hits of different user groups. Like OLAP solutions :-)

For example, to plot graph on hits for girls 15-25 age, from UK, from October to September, 2011.

Which database you recommend to choose, and solution to build OLAP cubes?

I'm looking for opensource solution, like HBase (+ zohmg or cascading) Hypertable or something else (free DWH :-)).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

两个我 2024-12-05 03:30:06

这是海量的数据,每天 300Mio 的点击量仅适用于少数部门。这相当于每年约 100 Tera (10e2 * 10e9) 次点击,哇!

我不知道有任何类似 OLAP 的解决方案可以在不进行预聚合的情况下处理如此大量的数据,即使考虑到付费解决方案也是如此。也许可以减少分析(又名 OLAP)系统的信息:消除或减小维度 user_id、http 地址和站点 id 的大小。

我很好奇你如何解决你的问题以及你如何获得 2 亿用户的性别、年龄组:-)。

That's massive amount of data, 300Mio daily hits it's only available to a few sectors. This is around 100 Tera (10e2 * 10e9) hits per year, wow !

I don't know of any OLAP like solutions that can handle such amount of data without pre-aggregations, even taking into account the paying solutions. Maybe working around reducing the information of your analytical (aka OLAP) system : getting rid or reducing the size of the dimensions user_id, http address and site id.

Would be curious how do you solve your problem and how did you get the gender, age group of your 200M users :-).

一杆小烟枪 2024-12-05 03:30:06

您也许能够(接近)实时地执行此操作。您可以预先汇总每一天。想象一下以下物化视图:

CREATE VIEW preAggregatedByDate
SELECT gender, ageGroup, country, date, COUNT(*) AS dailyCount
FROM myTable
GROUP BY date;

因此,您已经获得了每个日期的每个参数组合(性别、年龄组和国家/地区)的计数。对于您的查询,您会执行以下操作:

SELECT COUNT(*)
FROM preAggregatedByDate
WHERE gender = 'Female' AND ageGroup = '15_25' AND country = 'UK' AND
    date >= '20110901' AND date < '20111101';

显然,它不会这么简单(并且您可能不会使用 SQL 和物化视图),但是如果您熟悉传统查询,这可能是想象它的最佳方式数据库。您的实际查询只是从大大减少的预聚合数据源中提取数据。

如果您不介意在一天结束后等待几个小时才能运行当天的查询,最简单的解决方案可能是使用 Hadoop 计算预聚合数据并将其放入您可以使用的普通数据库中然后运行查询。

You might be able to do this in (close to) real time. You can pre-aggregate over each day. Imagine the following materialized view:

CREATE VIEW preAggregatedByDate
SELECT gender, ageGroup, country, date, COUNT(*) AS dailyCount
FROM myTable
GROUP BY date;

So you've got a count for each combination of your parameters (gender, age group, and country) over each date. For your query, you'd do something like:

SELECT COUNT(*)
FROM preAggregatedByDate
WHERE gender = 'Female' AND ageGroup = '15_25' AND country = 'UK' AND
    date >= '20110901' AND date < '20111101';

Obviously, it's not going to be this simple (and you might not get to use SQL and materialized views), but this might be the best way to imagine it if you are familiar with conventional databases. Your actual query is only pulling data from a greatly reduced, pre-aggregated data source.

If you don't mind having to wait a few hours after the end of the day to run queries over that day, the easiest solution may be to use Hadoop to compute the pre-aggregated data and put that into some normal database that you can then run queries on.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文