大规模数据仓库系统推荐

发布于 2024-07-07 12:11:12 字数 227 浏览 7 评论 0原文

我需要存储大量数据,并且能够生成报告 - 每个数据代表网站上的一个事件(我们谈论的是每秒超过 50 个事件,因此显然需要聚合较旧的数据)。

我正在评估实现这一点的方法,显然它需要可靠,并且应该尽可能容易扩展。 还应该能够以灵活有效的方式从数据生成报告。

我希望一些 SOers 有此类软件的经验,并可以提出建议,和/或指出其中的陷阱。

理想情况下,我想将其部署在 EC2 上。

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).

I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.

I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.

Ideally I'd like to deploy this on EC2.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

三生池水覆流年 2024-07-14 12:11:12

哇。 你正在打开一个巨大的话题。

有几件事就在我的脑海中......

  1. 仔细考虑一下您在事务部分中插入并在报告部分中读取的模式,如果您有非常大的数据量,那么您最好将它们分开,
  2. 仔细查看您可以容忍的实时交易报告和历史数据聚合报告之间的延迟。 也许您应该有一个定期运行并聚合您的交易的流程。
  3. 仔细查看任何需要您报告交易和聚合数据的要求,无论是在同一个报告中,还是作为从一个原型到另一个原型的深入分析,其中
  4. 包含一些有意义的查询和一些实际的数据量,
  5. 让您获得真正的生产质量、企业准备好的数据库,即 Oracle / MSSQL
  6. 考虑使用其他人的代码/产品进行报告,例如 Crystal/ BO / Cognos

正如我所说,巨大话题。 当我想到更多时,我将继续添加到我的列表中。

HTH 祝你好运

Wow. You are opening up a huge topic.

A few things right off the top of my head...

  1. think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
  2. look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
  3. look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
  4. prototype with some meaningful queries and some realistic data volumes
  5. get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
  6. think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos

as I say, huge topic. As I think of more I'll continue adding to my list.

HTH and good luck

2024-07-14 12:11:12

@Simon 提出了很多精彩的观点,我只添加一些观点并重申/强调其他一些观点:

  1. 使用时间戳的正确数据类型 - 确保 DBMS 具有适当的精度。
  2. 考虑排队捕获事件,允许多个线程/进程来处理事件的实际存储。
  3. 将事务和数据仓库的模式分开
  4. 认真考虑从事务数据库到数据仓库的定期 ETL。
  5. 请记住,您可能不会有 50 个事务/秒 24x7x365 - 峰值事务与平均事务
  6. 调查 DBMS 中的分区表。 Oracle 和 MSSQL 都将根据值(例如日期/时间)进行分区。
  7. 从一开始就制定归档/数据保留政策。 太多的项目只是开始记录数据,而没有计划删除/归档数据。

@Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:

  1. Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
  2. Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
  3. Separate the schemas for your transactional and data warehouse
  4. Seriously consider a periodic ETL from transactional db to the data warehouse.
  5. Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
  6. Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
  7. Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
高冷爸爸 2024-07-14 12:11:12

我很惊讶这里没有一个答案涵盖 Hadoop 和 HDFS - 我建议这是因为 SO 是一个程序员问答,而你的问题实际上是一个数据科学问题。

如果您正在处理大量查询和大量处理时间,您可以使用 HDFS(EC 上的分布式存储格式)来存储数据并在商用硬件上运行批量查询(即分析)。

然后,您可以根据需要配置尽可能多的 EC2 实例(数百或数千,具体取决于您的数据处理要求有多大),并针对您的数据运行映射减少查询以生成报告。

Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.

If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.

You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.

七婞 2024-07-14 12:11:12

哇..这是一个很大的话题。

让我从数据库开始。 如果您要拥有大量数据,首先要获得一些好东西。 我喜欢 Oracle 和 Teradata。

其次,记录交易数据和报告/分析之间存在明显的区别。 将事务数据放在一个区域中,然后定期将其汇总到报告区域(架构)中。

我相信你可以通过两种方式解决这个

  • 问题

    花钱解决问题:购买一流的软件(数据库,报告软件)并聘请一些熟练的技术人员来帮助

  • 采用本土方法:仅构建您现在需要的内容并有机地发展整个系统。 从一个简单的数据库开始,构建一个网络报告框架。 有很多优秀的开源工具和廉价的机构可以完成这项工作。

至于 EC2 方法..我不确定这如何适合数据存储策略。 处理能力有限,这正是 EC2 的强项。 您的主要目标是有效的存储和检索。

Wow.. This is a huge topic.

Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.

Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).

I believe you can approach this two ways

  • Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help

  • Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.

As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文