当前位置：文江博客话题详情

用于分析的数据库

发布于 2024-07-16 19:21:45 字数 1668 浏览 2 评论 0 原文

我正在建立一个大型数据库，它将根据传入数据生成统计报告。
该系统的大部分运行方式如下：

每天早上将上传大约 400k-500k 行（大约 30 列，主要是 varchar(5-30) 和 datetime）。以平面文件形式时，其大小约为 60MB，但通过添加合适的索引，在数据库中会急剧增长。
将从当天的数据生成各种统计数据。
将生成并存储这些统计数据的报告。
当前数据集将被复制到分区历史表中。
一整天，最终用户都可以查询当前数据集（已复制，而不是移动），以获取不太可能包含常量但包含字段之间关系的信息。
用户可以请求从历史表中进行专门的搜索，但查询将由 DBA 精心设计。
在第二天上传之前，当前数据表将被截断。

这本质上是我们现有系统的第二版。

现在，我们正在使用 MySQL 5.0 MyISAM 表（Innodb 仅在空间使用方面就很糟糕），并且在 #6 和 #4 上受到很大影响。 #4 当前不是分区表，因为 5.0 不支持它。为了避免将记录插入历史记录所需的大量时间（数小时和数小时），我们每天都将数据写入未索引的history_queue表，然后在周末最慢的时间将队列写入历史表。问题是，本周生成的任何历史查询可能会晚几天。我们不能减少历史表上的索引，否则它的查询将变得不可用。

我们肯定会在下一个版本中至少迁移到 MySQL 5.1（如果我们继续使用 MySQL），但会强烈考虑 PostgreSQL。我知道争论已经进行得很激烈，但我想知道是否有人对这种情况有任何建议。大多数研究都是围绕网站使用展开的。索引确实是 MySQL 的主要功能，而且 PostgreSQL 似乎可以通过部分索引和基于函数的索引来帮助我们。

我读过几十篇关于两者之间差异的文章，但大多数都是旧的。 PostgreSQL 长期以来一直被贴上“更先进，但速度更慢”的标签 - 比较 MySQL 5.1 和 PostgreSQL 8.3 是否仍然普遍如此，还是现在更加平衡？

商业数据库（Oracle 和 MS SQL）根本不是一个选择 - 尽管我希望 Oracle 是。

关于 MyISAM 与 Innodb 的注意事项：我们正在运行 Innodb，对于我们来说，我们发现它慢得多，大约慢了 3-4 倍。但是，我们对 MySQL 也比较陌生，坦率地说，我不确定我们是否针对 Innodb 进行了适当的数据库调整。

我们在一个具有非常高的正常运行时间的环境中运行 - 电池备份、故障转移网络连接、备用发电机、完全冗余系统等。因此，对 MyISAM 的完整性问题进行了权衡并认为可以接受。

关于5.1：我听说 5.1 存在稳定性问题。一般来说，我认为任何最近（过去 12 个月内）的软件都不是绝对稳定的。考虑到重新设计项目的机会，5.1 中更新的功能集实在是太多了，不容错过。

关于 PostgreSQL 的陷阱：不带任何 where 子句的 COUNT(*) 对我们来说是非常罕见的情况。我预计这不会成为问题。 COPY FROM 并不像 LOAD DATA INFILE 那么灵活，但中间加载表可以解决这个问题。我最担心的是缺少 INSERT IGNORE。我们在构建一些处理表时经常使用它，这样我们就可以避免将多个记录放入两次，然后在最后必须执行巨大的 GROUP BY 来删除一些重复项。我认为它的使用频率很低，因此缺乏它是可以容忍的。

原文

I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:

Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
Various statistics will be generated from the current day's data.
Reports from these statistics will be generated and stored.
Current data set will get copied into a partitioned history table.
Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
Before the next day's upload, the current data table is truncated.

This will essentially be version 2 of our existing system.

Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.

We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.

I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?

Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.

NOTE on MyISAM vs Innodb for us:
We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.

We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.

In regards to 5.1:
I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.

In regards to PostgreSQL gotchas:
COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue.
COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不羁少年 2024-07-23 19:21:46

我会选择 PostgreSQL。例如，您需要分区表，这些表至少从 2005 年起就在稳定的 Postgres 版本中 - 在 MySQL 中这是一个新事物。我听说过 5.1新功能的稳定性问题。使用 MyISAM，您没有参照完整性，事务和并发访问会受到很大影响 - 阅读此博客文章“在生产中使用 MyISAM”了解更多信息。

Postgres 在处理复杂查询时速度要快得多，这对你的 #6 有好处。
还有一个非常活跃且有用的邮件列表，您甚至可以从< href="http://www.postgresql.org/community/contributors/" rel="nofollow noreferrer">核心 Postgres 开发人员免费。它有一些问题不过。

回复收藏 0 原文

从﹋此江山别 2024-07-23 19:21:46

Infobright 的人们似乎正在沿着这些思路做一些有趣的事情：

http://www.infobright.org/

——PSJ

回复收藏 0 原文

情徒 2024-07-23 19:21:46

如果由于成本问题而不考虑 Oracle，则 Oracle Express Edition 是免费的（就像啤酒一样）。它有大小限制，但如果您不将历史记录保留太久，则不必担心。

回复收藏 0 原文

九命猫 2024-07-23 19:21:46

检查你的硬件。你是否最大化了 IO？您是否正确配置了缓冲区？您的硬件尺寸正确吗？用于缓冲的内存和快速磁盘是关键。

如果索引太多，插入速度会大大减慢。

你的插入内容做得怎么样？如果您为每个 INSERT 语句执行一条记录：

INSERT INTO TABLE blah VALUES (?, ?, ?, ?)

并调用它 500K 次，那么您的性能将会很糟糕。我很惊讶它在几个小时内就完成了。使用 MySQL，您可以一次插入数百或数千行：

INSERT INTO TABLE blah VALUES
  (?, ?, ?, ?),
  (?, ?, ?, ?),
  (?, ?, ?, ?)

如果您为每个 Web 请求执行一次插入，则应考虑记录到文件系统并在 crontab 上执行批量导入。我过去曾使用过这种设计来加快插入速度。这也意味着您的网页不依赖于数据库服务器。

使用LOAD DATA INFILE导入 CSV 文件也更快。请参阅http://dev.mysql.com/doc/refman /5.1/en/load-data.html

我可以建议的另一件事是要警惕 SQL 锤子——你可能没有 SQL 钉子。您是否考虑过使用像 Pig 或 Hive 为您的报告生成优化的数据集？

编辑

如果您在批量导入 500K 记录时遇到问题，则需要在某个地方进行妥协。我会在主表上删除一些索引，然后为每个报告创建数据的优化视图。

Check your hardware. Are you maxing the IO? Do you have buffers configured properly? Is your hardware sized correctly? Memory for buffering and fast disks are key.

If you have too many indexes, it'll slow inserts down substantially.

How are you doing your inserts? If you're doing one record per INSERT statement:

INSERT INTO TABLE blah VALUES (?, ?, ?, ?)

and calling it 500K times, your performance will suck. I'm surprised it's finishing in hours. With MySQL you can insert hundreds or thousands of rows at a time:

INSERT INTO TABLE blah VALUES
  (?, ?, ?, ?),
  (?, ?, ?, ?),
  (?, ?, ?, ?)

If you're doing one insert per web requests, you should consider logging to the file system and doing bulk imports on a crontab. I've used that design in the past to speed up inserts. It also means your webpages don't depend on the database server.

It's also much faster to use LOAD DATA INFILE to import a CSV file. See http://dev.mysql.com/doc/refman/5.1/en/load-data.html

The other thing I can suggest is be wary of the SQL hammer -- you may not have SQL nails. Have you considered using a tool like Pig or Hive to generate optimized data sets for your reports?

EDIT

If you're having troubles batch importing 500K records, you need to compromise somewhere. I would drop some indexes on your master table, then create optimized views of the data for each report.

回复收藏 0 原文

回忆躺在深渊里 2024-07-23 19:21:46

您是否尝试过使用 myisam_key_buffer 参数？这对于索引更新速度非常重要。

另外，如果您在日期、id 等相关列上有索引，您可以这样做：

INSERT INTO archive SELECT .. FROM current ORDER BY id (or date)

这个想法是按顺序插入行，在这种情况下，索引更新要快得多。当然，这仅适用于与 ORDER BY 一致的索引...如果您有一些相当随机的列，那么这些列将无济于事。

但强烈考虑 PostgreSQL。

你绝对应该测试一下。

看来PostgreSQL可以通过部分索引和基于函数的索引来帮助我们。

是的。

我读过几十篇关于两者之间差异的文章，但大多数都是旧的。 PostgreSQL 长期以来一直被贴上“更先进，但更慢”的标签 - 比较 MySQL 5.1 和 PostgreSQL 8.3 仍然是普遍情况还是现在更加平衡？

好吧，这取决于。与任何数据库一样，

如果您不知道如何配置和调整它，它会很慢
如果您的硬件不能胜任任务，它会很慢

一些熟悉 mysql 并想尝试 postgres 的人不会考虑事实上，他们需要重新学习一些东西并阅读文档，因此对配置非常糟糕的 postgres 进行基准测试，这可能会非常慢。

对于 Web 使用，我使用我编写的自定义基准测试论坛对低端服务器（Core 2 Duo、SATA 磁盘）上配置良好的 postgres 进行了基准测试，它每秒生成超过 4000 个论坛网页，使数据库饱和服务器的千兆位以太网链路。因此，如果你知道如何使用它，它的速度会非常快（InnoDB 由于并发问题要慢得多）。 “MyISAM 对于小型简单选择来说速度更快”这句话完全是胡言乱语，postgres 将在 50-100 微秒内完成一个“小型简单选择”。

现在，对于您的使用，您不关心这一点；）

您关心数据库计算大聚合和大连接的方式，并且正确配置的 postgres 具有良好的 IO 系统通常会在这些方面胜过 MySQL 系统，因为优化器更加智能，并且有更多的连接/聚合类型可供选择。

我最担心的是缺少 INSERT IGNORE。我们在构建一些处理表时经常使用它，这样我们就可以避免将多个记录放入两次，然后在最后必须执行巨大的 GROUP BY 来删除一些重复项。我认为它的使用频率很低，因此缺乏它是可以容忍的。

您可以使用 GROUP BY，但如果您只想将尚不存在的记录插入表中，您可以这样做：

INSERT INTO target SELECT .. FROM source LEFT JOIN target ON (...) WHERE target.id IS NULL

在您的用例中，您没有并发问题，因此效果很好。

Have you tried playing with the myisam_key_buffer parameter ? It is very important in index update speed.

Also if you have indexes on date, id, etc which are correlated columns, you can do :

INSERT INTO archive SELECT .. FROM current ORDER BY id (or date)

The idea is to insert the rows in order, in this case the index update is much faster. Of course this only works for the indexes that agree with the ORDER BY... If you have some rather random columns, then those won't be helped.

but strongly considering PostgreSQL.

You should definitely test it.

it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.

Yep.

I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?

Well that depends. As with any database,

IF YOU DONT KNOW HOW TO CONFIGURE AND TUNE IT IT WILL BE SLOW
If your hardware is not up to the task, it will be slow

Some people who know mysql well and want to try postgres don't factor in the fact that they need to re-learn some things and read the docs, as a result a really badly configured postgres is benchmarked, and that can be pretty slow.

For web usage, I've benchmarked a well configured postgres on a low-end server (Core 2 Duo, SATA disk) with a custom benchmark forum that I wrote and it spit out more than 4000 forum web pages per second, saturating the database server's gigabit ethernet link. So if you know how to use it, it can be screaming fast (InnoDB was much slower due to concurrency issues). "MyISAM is faster for small simple selects" is total bull, postgres will zap a "small simple select" in 50-100 microseconds.

Now, for your usage, you don't care about that ;)

You care about the ways your database can compute Big Aggregates and Big Joins, and a properly configured postgres with a good IO system will usually win against a MySQL system on those, because the optimizer is much smarter, and has many more join/aggregate types to choose from.

My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.

You can use a GROUP BY, but if you want to insert into a table only records that are not already there, you can do this :

INSERT INTO target SELECT .. FROM source LEFT JOIN target ON (...) WHERE target.id IS NULL

In your use case you have no concurrency problems, so that works well.

回复收藏 0 原文

软糯酥胸 2024-07-23 19:21:45

我的工作尝试了一个试点项目，从 ERP 设置迁移历史数据。数据大小偏小，只有 60GB，覆盖约 2100 万行，最大的表有 1600 万行。还有大约 1500 万行等待进入管道，但由于其他优先事项，试点已被搁置。该计划是使用 PostgreSQL 的“作业”工具来安排查询，每天重新生成适合分析使用的数据。

在包含 1600 万条记录的大型表上运行简单聚合时，我注意到的第一件事是它对可用 RAM 量的敏感程度。 RAM 的某个时刻的增加允许一年的聚合，而无需诉诸顺序表扫描。

如果您决定使用 PostgreSQL，我强烈建议您重新调整配置文件，因为它往往会尽可能采用最保守的设置（以便它可以在 RAM 很少的系统上运行）。调整需要一点时间，可能需要几个小时，但是一旦达到响应可接受的程度，只需设置即可，然后忘记它。

一旦完成服务器端调整（这一切都与内存有关，令人惊讶！），您将把注意力转向索引。索引和查询规划也需要一些努力，但一旦设置，您就会发现它是有效的。部分索引是一个很好的功能，用于隔离那些包含“边缘情况”数据的记录，如果您正在大量类似数据中寻找异常，我强烈推荐此功能。

最后，使用表空间功能将数据重新定位到快速驱动器阵列上。

回复收藏 0 原文

梦里°也失望 2024-07-23 19:21:45

根据我的实践经验，我不得不说，postgresql 从 7.x/8.0 到 8.1 有相当大的性能跳跃（对于我们的用例，在某些情况下快了 2 倍到 3 倍），从 8.1 到 8.2 的改进较小，但仍然很明显。我不知道8.2和8.3之间有什么改进，但我预计性能也会有一些改进，到目前为止我还没有测试过。

关于索引，我建议删除它们，只有在用数据填充数据库后才重新创建它们，这样要快得多。

进一步改进你的 postgresql 设置，从中可以获得很多好处。默认设置至少现在是合理的，在 8.2 版本之前，pg 已针对在 pda 上运行进行了优化。

在某些情况下，特别是如果您有复杂的查询，它可以帮助停用设置中的嵌套循环，这迫使 pg 对查询使用性能更好的方法。

啊，是的，我有说过你应该选择 postgresql 吗？

（另一种选择是 firebird，它不太灵活，但根据我的经验，在某些情况下它的性能比 mysql 和 postgresql 好得多）

回复收藏 0 原文

陌生 2024-07-23 19:21:45

根据我的经验，对于真正简单的查询，Inodb 稍快一些，对于更复杂的查询，pg 稍快一些。 Myisam 的检索速度可能比 Innodb 更快，但索引/索引修复可能更慢。

这些大多是 varchar 字段，您是否使用 char(n) 索引对它们进行索引？

你能将其中一些标准化吗？重写会花费您的成本，但可能会节省后续查询的时间，因为您的行大小会减小，从而一次将更多行装入内存。

编辑时：

好的，所以您有两个问题，查询每日时间和更新历史记录，是吗？

至于第二个：根据我的经验，mysql myism 不擅长重新索引。在日常大小的表上（0.5 到 1M 记录，具有相当宽的（非规范化平面输入）记录），我发现重写表比插入并等待重新索引和随之而来的磁盘抖动要快。

因此，这可能有帮助，也可能没有帮助：

create new_table select * from old_table ;

复制表但不复制索引。

然后像平常一样插入新记录。然后在新表上创建索引，等待一段时间。删除旧表，并将新表重命名为旧表。

编辑：回应第四条评论：我不知道 MyIsam 总是那么糟糕。我知道在我的特定情况下，我对复制表然后添加索引的速度如此之快感到震惊。碰巧，我正在做与您所做的类似的事情，将大型非规范化平面文件复制到数据库中，然后重新规范化数据。但这只是轶事，而不是数据。 ;)

（我还认为我发现整体 InnoDb 更快，因为我执行的插入和查询一样多。数据库使用的一个非常特殊的情况。）

请注意，使用 select a.*, b.value as foo join 进行复制... 也比更新 a.foo = b.value ... join 更快，因为更新是针对索引列的。

In my experience Inodb is slighly faster for really simple queries, pg for more complex queries. Myisam is probably even faster than Innodb for retrieval, but perhaps slower for indexing/index repair.

These mostly varchar fields, are you indexing them with char(n) indexes?

Can you normalize some of them? It'll cost you on the rewrite, but may save time on subsequent queries, as your row size will decrease, thus fitting more rows into memory at one time.

ON EDIT:

OK, so you have two problems, query time against the daily, and updating the history, yes?

As to the second: in my experience, mysql myism is bad at re-indexing. On tables the size of your daily (0.5 to 1M records, with rather wide (denormalized flat input) records), I found it was faster to re-write the table than to insert and wait for the re-indexing and attendant disk thrashing.

So this might or might not help:

create new_table select * from old_table ;

copies the tables but no indices.

Then insert the new records as normally. Then create the indexes on new table, wait a while. Drop old table, and rename new table to old table.

Edit: In response to the fourth comment: I don't know that MyIsam is always that bad. I know in my particular case, I was shocked at how much faster copying the table and then adding the index was. As it happened, I was doing something similar to what you were doing, copying large denormalized flat files into the database, and then renormalizing the data. But that's an anecdote, not data. ;)

(I also think I found that overall InnoDb was faster, given that I was doing as much inserting as querying. A very special case of database use.)

Note that copying with a select a.*, b.value as foo join ... was also faster than an update a.foo = b.value ... join, which follows, as the update was to an indexed column.

回复收藏 0 原文