大数据量的数据库选择?

发布于 2024-07-14 23:44:25 字数 432 浏览 12 评论 0原文

我即将开始一个新项目,该项目应该有一个相当大的数据库。

表的数量不会很大(<15),大部分数据(99%)将包含在一张大表中,几乎是插入/只读(无更新)。

该表中的估计数据量将以每天 500.000 条记录的速度增长,我们应该保留其中至少1 年以便能够进行各种报告。

需要有(只读)复制数据库作为备份/故障转移,并且可能用于在高峰时间卸载报告。

我没有使用大型数据库的第一手经验,因此我询问拥有哪种数据库的人在这种情况下是最佳选择。 我知道 Oracle 是安全的选择,但如果有人有使用类似设置的 PostgresqlMysql 的经验,我会更感兴趣。

I'm about to start a new project which should have a rather large database.

The number of tables will not be large (<15), majority of data (99%) will be contained in one big table, which is almost insert/read only (no updates).

The estimated amount of data in that one table is going to grow at 500.000 records a day, and we should keep at least 1 year of them to be able to do various reports.

There needs to be (read-only) replicated database as a backup/failover, and maybe for offloading reports in peak time.

I don't have first hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. I know that Oracle is the safe bet, but am more interested if anyone have experience with Postgresql or Mysql with similar setup.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

2024-07-21 23:44:25

我在每天都会看到 100K-2M 新行的环境中使用 PostgreSQL,其中大部分添加到单个表中。 然而,这些行往往会被缩减为样本,然后在几天内被删除,因此我无法谈论超过 1 亿行的长期性能。

我发现插入性能相当合理,特别是如果您使用批量复制。 查询性能很好,尽管规划器所做的选择有时让我感到困惑; 特别是在进行 JOIN / EXISTS 时。 我们的数据库需要定期维护(VACUUM/ANALYZE)以保持其平稳运行。 我可以通过更仔细地优化 autovacuum 和其他设置来避免其中的一些问题,如果您不执行多次删除,那么这并不是什么大问题。 总体而言,我认为在某些方面配置和维护比应有的更加困难。

我没有使用过Oracle,MySQL仅适用于小数据集,所以我无法比较性能。 但 PostgreSQL 对于大型数据集工作很好。

I've used PostgreSQL in an environment where we're seeing 100K-2M new rows per day, most added to a single table. However, those rows tend to be reduced to samples and then deleted within a few days, so I can't speak about long-term performance with more than ~100M rows.

I've found that insert performance is quite reasonable, especially if you use the bulk COPY. Query performance is fine, although the choices the planner makes sometimes puzzle me; particularly when doing JOINs / EXISTS. Our database requires pretty regular maintenance (VACUUM/ANALYZE) to keep it running smoothly. I could avoid some of this by more carefully optimizing autovacuum and other settings, and it's not so much of an issue if you're not doing many DELETEs. Overall, there are some areas where I feel it's more difficult to configure and maintain than it should be.

I have not used Oracle, and MySQL only for small datasets, so I can't compare performance. But PostgreSQL does work fine for large datasets.

无力看清 2024-07-21 23:44:25

您是否有“数据仓库工具包"?

建议执行以下操作。

  1. 将事实(可测量的、数字的)值与限定或组织这些事实的维度分开。 一张大桌子并不是最好的主意。 它是一个主导设计的事实表,加上一些小维度表,允许对事实进行“切片和切块”。

  2. 将事实保存在简单的平面文件中,直到您想要进行 SQL 样式的报告为止。 不要创建和备份数据库。 创建和备份文件; 仅为必须从 SQL 执行的报告加载数据库。

  3. 在可能的情况下创建摘要或额外的数据集市以进行分析。 在某些情况下,您可能需要将整个内容加载到数据库中。 如果您的文件反映了您的表设计,则所有数据库都具有批量加载器工具,可以从文件中填充 SQL 表并为其建立索引。

Do you have a copy of "The Data Warehouse Toolkit"?

The suggestion there is to do the following.

  1. Separate facts (measurable, numeric) values from the dimensions which qualify or organize those facts. One big table isn't really the best idea. It's a fact table that dominates the design, plus a number of small dimension tables to allow "slicing and dicing" the facts.

  2. Keep the facts in simple flat files until you want to do SQL-style reporting. Don't create and back up a database. Create and back up files; load a data base only for the reports you must do from SQL.

  3. Where possible create summary or extra datamarts for analysis. In some cases, you may need to load the whole thing to a database. If your files reflect your table design, all databases have bulk loader tools that can populate and index SQL tables from the files.

戏舞 2024-07-21 23:44:25

数据量(每年 2 亿条记录)并不是很大,应该与任何标准数据库引擎相匹配。

如果您不需要实时报告,情况会更容易。 我会在其他服务器上镜像和预聚合数据,例如每日批量。 正如 S.Lott 所建议的,您可能想阅读有关数据仓库的内容。

The amount of data (200m records per year) is not really big and should go with any standard database engine.

The case is yet easier if you do not need live reports on it. I'd mirror and preaggregate data on some other server in e.g. daily batch. Like S.Lott suggested, you might like to read up on data warehousing.

残疾 2024-07-21 23:44:25

关于 Google BigTable 的一些有趣的点有...

Bigtable 与 DBMS

  • 快速查询率
  • 无连接,无 SQL 支持,面向列的数据库
  • 使用一个 Bigtable 而不是拥有许多规范化表
  • 甚至不在 1NF 中在传统视图中
  • 旨在支持历史查询时间戳字段 => 昨天这个网页是什么样子的?
  • 数据压缩更容易——行稀疏,

正如您提到的,您将需要运行一系列报告,我强调了连接和无 SQL 支持。 如果您在哪里使用此功能,我不知道没有能力执行此操作会对您运行报告产生多少影响(如果有的话)。

Some interesting points regarding Google BigTable in there are...

Bigtable Vs DBMS

  • Fast Query rate
  • No Joins, No SQL support, column-oriented database
  • Uses one Bigtable instead of having many normalized tables
  • Is not even in 1NF in a traditional view
  • Designed to support historical queries timestamp field => what did this webpage look like yesterday ?
  • Data compression is easier –rows are sparse

I highlighted the Joins and No SQL Support as you mentioned you will need to run a series of reports. I dont know how much (if any) not having the abililty to do this will have on you running reports if you where to use this.

陌上芳菲 2024-07-21 23:44:25

Google 的 BigTable 数据库Hadoop 是两个可以处理大量数据的数据库引擎。

Google's BigTable database and Hadoop are two database engines that can handle large amount of data.

各自安好 2024-07-21 23:44:25

我们使用 Firebird 来构建一个非常庞大的数据库(现已保存数据超过 30 年),并且它的扩展性非常好。

它的最大优点是您可以配置属性,但与 ie Oracle 不同,您安装它并且它运行得很好,无需在使用它之前开始配置。

We use Firebird for a really huge database (keeping data for more than 30 years now) and it scales very well.

The best about it is that you have properties to configure, but unlike i.e. Oracle you install it and it works very well without the need to start configuring before you can use it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文