BigTable可以做OLAP吗?

发布于 2024-08-04 22:25:04 字数 720 浏览 6 评论 0原文

过去,我曾经使用在 MySQL 上运行的 OLAP 多维数据集来构建 WebAnalytics。 现在,我使用的 OLAP 多维数据集只是一个大表(好吧,它的存储方式比这更智能),其中每一行基本上都是一个测量值或测量值的聚合集。每个测量都有一堆维度(即哪个页面名称、用户代理、IP 等)和一堆值(即有多少页面浏览量、多少访问者等)。

您在这样的表上运行的查询通常采用以下形式(元 SQL):

SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour

因此,您可以使用上述过滤器获取所选日期的每个小时的总计。 一个障碍是这些多维数据集通常意味着全表扫描(出于各种原因),这意味着您可以制作这些东西的大小(以 MiB 为单位)的实际限制。

我目前正在学习 Hadoop 等的详细信息。

在 BigTable 上将上述查询作为 MapReduce 运行看起来很简单: 只需将“小时”作为键,在地图中进行过滤并通过对值求和来减少即可。

您能否在 BigTable 类型的系统上“实时”(即通过用户界面,用户尽快得到答案)而不是批处理模式运行像我上面所示的查询(或至少具有相同的输出)?

如果不;在 BigTable/Hadoop/HBase/Hive 等领域执行此类操作的合适技术是什么?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).

The queries that you run on a table like this are usually of the form (meta-SQL):

SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour

So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.

I'm currently learning the ins and outs of Hadoop and the likes.

Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.

Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?

If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

青丝拂面 2024-08-11 22:25:04

它甚至已经完成了(有点)。

LastFm 的聚合/摘要引擎: http://github.com/zohmg/zohmg

谷歌搜索出现了谷歌代码项目“mroll”,但除了联系信息之外没有任何内容(没有代码,什么都没有)。不过,可能还是想联系那个人,看看发生了什么事。 http://code.google.com/p/mroll/

It's even kind of been done (kind of).

LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg

A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/

暮年慕年 2024-08-11 22:25:04

通过预先聚合 SQL 查询并将其映射到适当的 Hbase 限定符,我们成功地在 HBase 中创建了低延迟 OLAP。欲了解更多详情,请访问以下网站。

http://soomyajitswain.blogspot.in/2012/10/hbase -低延迟-olap.html

We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.

http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html

陌上青苔 2024-08-11 22:25:04

我的回答与 HBase 相关,但同样适用于 BigTable。

Urban Airship 开源 datacube,我认为这很接近你想要的。请在此处查看他们的演示文稿

Adobe 还有一些演示文稿 (此处此处)关于他们如何使用 HBase 进行“低延迟 OLAP”。

My answer relates to HBase, but applies equally to BigTable.

Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.

Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.

浅唱々樱花落 2024-08-11 22:25:04

Andrei Dragomir 就 Adob​​e 如何使用 M/R 和 HBase 执行 OLAP 功能进行了有趣的演讲。

视频:http://www.youtube.com/watch?v=5U3EnfiKs44

幻灯片: http://hstack.org/hbasecon-low-latency-olap-with- hbase/

Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.

Video: http://www.youtube.com/watch?v=5U3EnfiKs44

Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/

倒带 2024-08-11 22:25:04

如果您正在寻找表扫描方法,您是否考虑过 Google BigQuery? BigQuery 在后端自动横向扩展,提供交互式响应。 Jordan Tigani 在 2012 年 Google I/O 活动中发表了一篇精彩的演讲,解释了一些内部原理。

http://www.youtube.com/watch?v=QI8623HlYd4

它不是 MapReduce,而是它面向高速表扫描,就像您所描述的那样。

If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.

http://www.youtube.com/watch?v=QI8623HlYd4

It's not MapReduce but it is geared towards high-speed table scan like what you described.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文