大规模数据仓库系统推荐
我需要存储大量数据,并且能够生成报告 - 每个数据代表网站上的一个事件(我们谈论的是每秒超过 50 个事件,因此显然需要聚合较旧的数据)。
我正在评估实现这一点的方法,显然它需要可靠,并且应该尽可能容易扩展。 还应该能够以灵活有效的方式从数据生成报告。
我希望一些 SOers 有此类软件的经验,并可以提出建议,和/或指出其中的陷阱。
理想情况下,我想将其部署在 EC2 上。
I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
哇。 你正在打开一个巨大的话题。
有几件事就在我的脑海中......
正如我所说,巨大话题。 当我想到更多时,我将继续添加到我的列表中。
HTH 祝你好运
Wow. You are opening up a huge topic.
A few things right off the top of my head...
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
@Simon 提出了很多精彩的观点,我只添加一些观点并重申/强调其他一些观点:
@Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
我很惊讶这里没有一个答案涵盖 Hadoop 和 HDFS - 我建议这是因为 SO 是一个程序员问答,而你的问题实际上是一个数据科学问题。
如果您正在处理大量查询和大量处理时间,您可以使用 HDFS(EC 上的分布式存储格式)来存储数据并在商用硬件上运行批量查询(即分析)。
然后,您可以根据需要配置尽可能多的 EC2 实例(数百或数千,具体取决于您的数据处理要求有多大),并针对您的数据运行映射减少查询以生成报告。
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
哇..这是一个很大的话题。
让我从数据库开始。 如果您要拥有大量数据,首先要获得一些好东西。 我喜欢 Oracle 和 Teradata。
其次,记录交易数据和报告/分析之间存在明显的区别。 将事务数据放在一个区域中,然后定期将其汇总到报告区域(架构)中。
我相信你可以通过两种方式解决这个
花钱解决问题:购买一流的软件(数据库,报告软件)并聘请一些熟练的技术人员来帮助
采用本土方法:仅构建您现在需要的内容并有机地发展整个系统。 从一个简单的数据库开始,构建一个网络报告框架。 有很多优秀的开源工具和廉价的机构可以完成这项工作。
至于 EC2 方法..我不确定这如何适合数据存储策略。 处理能力有限,这正是 EC2 的强项。 您的主要目标是有效的存储和检索。
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.