网络规模分析应用程序的数据库选择
我想构建一个类似于 Google-Analytics 的网络应用程序,在其中收集有关客户最终用户的统计数据,并根据该数据显示客户分析。
特点:
- 高可扩展性,处理非常大的数据量
- 划分 - 查询始终在单个客户的数据上运行
- 支持分析查询(向下钻取、切片等)
由于分析需要,我正在考虑使用 OLAP/BI 套件,但我不确定它是否适合这种规模。 NoSQL 数据库?简单的 RDBMS 可以吗?
I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data.
Characteristics:
- High scalability, handle very large volume
- Compartmentalized - Queries always run on a single customer's data
- Support analytical queries (drill-down, slices, etc.)
Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这些是我在生产环境中使用的,它的作用就像一个魅力。
我完成了三件事
PostgreSQL + LucidDB + Mondrian (更一般地说是整个 Pentaho BI 套件组件)
PostgreSQL :我不会描述 postgresql,真正强大的开源 RDBMS 会让你做 - 当然 - 你需要的一切。我用它来存储我的操作数据。
LucidDB:LucidDB 是一个开源列存储数据库。高度可扩展,与 PostgreSQL 相比,在检索大量数据时将提供真正的处理时间增益。它没有针对事务处理进行优化,而是针对密集读取进行优化。这是我的数据仓库数据库
Mondrian :Mondrian 是一个开源 R-OLAP 多维数据集。 LucidDB 可以轻松地将这两个程序连接在一起。
我建议您查看整个 Pentaho BI Suite,它值得,您可能想使用其中的一些组件。
希望我能帮忙,
These what I am using at work in a production environnement and it works like a charm.
I copled three things
PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)
PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.
I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.
Hope I could help,
您可以选择两种主要架构来实现真正的网络规模:
1。 “BI”架构
2. “NoSQL”架构
存在不可变事件存储或日志记录器,因为在大多数情况下,您希望对分析事件进行批处理并对数据库进行批量更新(即使使用 HDFS 之类的东西),而不是对每个页面视图进行原子写入对于
我们基于 Hadoop 和 Hive 构建的开源分析平台 SnowPlow,事件日志全部收集首先在 S3 上,然后批量加载到 Hive 中。
请注意,“NoSQL 架构”将涉及更多的开发工作。请记住,无论使用哪种架构,如果数据量增长真正达到史诗级(每个客户数十亿行),您始终可以按客户进行分片 - 因为不需要(我猜)跨客户分析。
There are two main architectures you could opt for for true web-scale:
1. "BI" architecture
2. "NoSQL" architecture
The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.
For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.
Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.
我想说,实施 OLAP 分析总是好的,而且使用 MDX 进行复杂的数据分析具有巨大的潜力。
干杯。
免责声明:我将为我自己的解决方案做一些宣传 - 请查看 www.icCube.com 并与我联系了解更多详情
I'd say that having put in place OLAP analysis is always nice and then has great potential for sophisticated data analysis using MDX.
Cheers.
Disclaimer : I'll make some publicity for my own solution - have a look to www.icCube.com and contact me for more details