BigTable可以做OLAP吗？

发布于 2024-08-04 22:25:04 字数 720 浏览 10 评论 0原文

过去，我曾经使用在 MySQL 上运行的 OLAP 多维数据集来构建 WebAnalytics。现在，我使用的 OLAP 多维数据集只是一个大表（好吧，它的存储方式比这更智能），其中每一行基本上都是一个测量值或测量值的聚合集。每个测量都有一堆维度（即哪个页面名称、用户代理、IP 等）和一堆值（即有多少页面浏览量、多少访问者等）。

您在这样的表上运行的查询通常采用以下形式（元 SQL）：

SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour

因此，您可以使用上述过滤器获取所选日期的每个小时的总计。一个障碍是这些多维数据集通常意味着全表扫描（出于各种原因），这意味着您可以制作这些东西的大小（以 MiB 为单位）的实际限制。

我目前正在学习 Hadoop 等的详细信息。

在 BigTable 上将上述查询作为 MapReduce 运行看起来很简单：只需将“小时”作为键，在地图中进行过滤并通过对值求和来减少即可。

您能否在 BigTable 类型的系统上“实时”（即通过用户界面，用户尽快得到答案）而不是批处理模式运行像我上面所示的查询（或至少具有相同的输出）？

如果不;在 BigTable/Hadoop/HBase/Hive 等领域执行此类操作的合适技术是什么？

原文

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).

The queries that you run on a table like this are usually of the form (meta-SQL):

SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour

So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.

I'm currently learning the ins and outs of Hadoop and the likes.

Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.

Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?

If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?

分享到QQ

分享到微博