最优数据库表优化方法
我有一个数据库表变得太大(几亿行)需要优化,但在对其进行分区之前,我想我应该询问建议。
这是用法:
0。表包含约 10 列,每列长度约 20 字节。
INSERTS 以每秒数百次的速率执行。
每小时根据列“a”(其中 a='xxxx')执行几次 SELECT 语句。
DELETE 语句是基于 DATE 列执行的。 (删除超过 1 年的日期)通常每天一次。
关键要求是加快 INSERT 和 SELECT 语句的速度,并且能够保留 1 年前的历史数据,而无需在删除时锁定整个表。
我猜想我必须有两个索引,一个用于列“a”,另一个用于日期字段。或者是否可以同时优化两者?
选择速度和删除速度之间是否需要进行权衡?
分区是唯一的解决方案吗?对此类表进行分区的好策略是什么?
我使用的是 PostgreSQL 8.4 数据库。
I have a database table that is growing too big (few hundred million rows) that needs to be optimized, but before I get into partitioning it, I thought I'd ask about suggestions.
Here is the usage:
0 . Table contains about 10 columns of length about 20 bytes each.
INSERTS are performed at a rate of hundreds of times per second.
SELECT statements are performed based on column 'a' (where a='xxxx' ) a few times per hour.
DELETE statements are performed based on a DATE column. (delete where date older than 1 year) usually once per day.
The key requirement is to speed up INSERT and SELECT statements, and be able to keep history data of 1 year back without locking the whole table down while deleting.
I would guess that I must have two indexes, one for column 'a', and the other for the date field. or is it possible to optimize both?
Will there be a necessary trade-off between speed on select and speed of delete?
Is partitioning the only solution? What are good strategies for partitioning such table?
I'm using a PostgreSQL 8.4 database.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您是否没有研究过 PostgreSQL 分区<,而不是将其保留为单个物理表/a>?从 8.1 版本开始受支持。
分区可以帮助您避免在快速插入和快速删除性能之间进行选择的问题。您始终可以按年/月对表进行分区,然后删除不再需要的分区。删除分区非常快,插入小分区也非常快。
从手册中:
Rather than keeping it a single physical table, have you looked into PostgreSQL partitioning? It's supported as of version 8.1.
Partitioning can help you avoid the problem of choosing between fast INSERT vs fast DELETE performance. You can always partition the table by Year/Month, and just drop the partitions that you no longer need. Dropping partitions is extremely fast, and inserting into small partitions is also extremely fast.
From the manual:
正如其他人所说,分区是您的答案,但是:
我会根据一些
hash(a)
进行分区。如果a
是一个整数,那么a%256
就可以了。如果它是文本,则类似于substring(md5(a) for 2)
。它将加快插入和选择的速度。
对于删除,我会让它们运行得更频繁,但更小,并且也进行分区。我每小时运行一次(在 XX:30),如下所示:
编辑:我刚刚测试了这个:
您需要将
hash(id)=hash('searched_value')
添加到查询中,否则 Postgres 将搜索所有表。编辑:您还可以使用规则系统自动插入以更正表:
Partitioning is your answer, as others stated, but:
I'd partition on some
hash(a)
. Ifa
is an integer thena%256
would be good. If it is a text then something likesubstring(md5(a) for 2)
.It will speed up inserts and selects.
For deletes I'd make them run more often but smaller and also partitioned. I'd run them every hour (at XX:30) and like this:
EDIT: I've just tested this:
You'd need to add
hash(id)=hash('searched_value')
to your queries or Postgres will search all tables.EDIT: You can also use rule system for automatic insertions to correct tables:
一种解决方案是根据插入日期进行分区。
也就是说,您的应用程序(或 DAO)根据某种逻辑来决定插入哪个表,该逻辑结合了当前日期(或者更确切地说,自上次分区切片启动以来的时间)和/或“最后”分区的大小。或者将此类逻辑卸载到日常脚本中,并让脚本填充一些“这是要使用的分区”以供 DAO 使用。
这立即消除了您删除“旧”行的需要(只需删除旧分区);
它还确保您的插入定期开始填充小表,这除其他外,加快了“平均”INSERT/SELECT 速度(当然,最坏的情况仍然很慢)
One solution is to partition based on insert date.
Namely, your application (or DAO) decides which table to insert into based on some logic which combines current date (or rather time since last partition slice was started) and/or te size of "last" partition. Or offload such logic into a daily script, and have the script populate some "this is the partition to use" for DAO's use.
This immediately removes your need to delete "old" rows (just drop the old partition);
it also ensures that your inserts periodically start populating small table, which, among other things, speeds up the "average" INSERT/SELECT speed (worst case scenarios are still just as slow of course)
如果您要将此表分解为适当的分区,您将能够使用截断而不是删除,这将降低您的维护成本,因为它不会产生死空间。
If you were to break this table up into proper partitioning, you would be able to use truncate instead of delete which would decrease your maintenance costs as it does not create dead space.
我不是专家,但似乎对“a”列进行分区会加快您的选择速度,但对日期进行分区(正如所有其他答案所建议的那样)会加快删除速度(删除表),但对您来说毫无用处选择。
看来,这两种情况都会提高插入性能。
有专家关心这个问题吗?
对两个字段进行分区是否可能/有用?
I am no expert but it seems that partitioning on the column "a" would speed up your selects but partitioning on the date (as all the other answers are suggesting) would speed up the deleting (drop the table) but would be useless for your select.
It seems, both cases would enhance the insert performance.
Any expert care to weight in on the issue ?
Is it possible / useful to partition on both fields ?