BigTable可以做OLAP吗?
过去,我曾经使用在 MySQL 上运行的 OLAP 多维数据集来构建 WebAnalytics。 现在,我使用的 OLAP 多维数据集只是一个大表(好吧,它的存储方式比这更智能),其中每一行基本上都是一个测量值或测量值的聚合集。每个测量都有一堆维度(即哪个页面名称、用户代理、IP 等)和一堆值(即有多少页面浏览量、多少访问者等)。
您在这样的表上运行的查询通常采用以下形式(元 SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
因此,您可以使用上述过滤器获取所选日期的每个小时的总计。 一个障碍是这些多维数据集通常意味着全表扫描(出于各种原因),这意味着您可以制作这些东西的大小(以 MiB 为单位)的实际限制。
我目前正在学习 Hadoop 等的详细信息。
在 BigTable 上将上述查询作为 MapReduce 运行看起来很简单: 只需将“小时”作为键,在地图中进行过滤并通过对值求和来减少即可。
您能否在 BigTable 类型的系统上“实时”(即通过用户界面,用户尽快得到答案)而不是批处理模式运行像我上面所示的查询(或至少具有相同的输出)?
如果不;在 BigTable/Hadoop/HBase/Hive 等领域执行此类操作的合适技术是什么?
In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
它甚至已经完成了(有点)。
LastFm 的聚合/摘要引擎: http://github.com/zohmg/zohmg
谷歌搜索出现了谷歌代码项目“mroll”,但除了联系信息之外没有任何内容(没有代码,什么都没有)。不过,可能还是想联系那个人,看看发生了什么事。 http://code.google.com/p/mroll/
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
通过预先聚合 SQL 查询并将其映射到适当的 Hbase 限定符,我们成功地在 HBase 中创建了低延迟 OLAP。欲了解更多详情,请访问以下网站。
http://soomyajitswain.blogspot.in/2012/10/hbase -低延迟-olap.html
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
我的回答与 HBase 相关,但同样适用于 BigTable。
Urban Airship 开源 datacube,我认为这很接近你想要的。请在此处查看他们的演示文稿。
Adobe 还有一些演示文稿 (此处和此处)关于他们如何使用 HBase 进行“低延迟 OLAP”。
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir 就 Adobe 如何使用 M/R 和 HBase 执行 OLAP 功能进行了有趣的演讲。
视频:http://www.youtube.com/watch?v=5U3EnfiKs44
幻灯片: http://hstack.org/hbasecon-low-latency-olap-with- hbase/
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
如果您正在寻找表扫描方法,您是否考虑过 Google BigQuery? BigQuery 在后端自动横向扩展,提供交互式响应。 Jordan Tigani 在 2012 年 Google I/O 活动中发表了一篇精彩的演讲,解释了一些内部原理。
http://www.youtube.com/watch?v=QI8623HlYd4
它不是 MapReduce,而是它面向高速表扫描,就像您所描述的那样。
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.