最优数据库表优化方法

发布于 2024-08-23 10:54:33 字数 487 浏览 2 评论 0原文

我有一个数据库表变得太大(几亿行)需要优化,但在对其进行分区之前,我想我应该询问建议。

这是用法:

0。表包含约 10 列,每列长度约 20 字节。

  1. INSERTS 以每秒数百次的速率执行。

  2. 每小时根据列“a”(其中 a='xxxx')执行几次 SELECT 语句。

  3. DELETE 语句是基于 DATE 列执行的。 (删除超过 1 年的日期)通常每天一次。

关键要求是加快 INSERT 和 SELECT 语句的速度,并且能够保留 1 年前的历史数据,而无需在删除时锁定整个表。

我猜想我必须有两个索引,一个用于列“a”,另一个用于日期字段。或者是否可以同时优化两者?

选择速度和删除速度之间是否需要进行权衡?

分区是唯一的解决方案吗?对此类表进行分区的好策略是什么?

我使用的是 PostgreSQL 8.4 数据库。

I have a database table that is growing too big (few hundred million rows) that needs to be optimized, but before I get into partitioning it, I thought I'd ask about suggestions.

Here is the usage:

0 . Table contains about 10 columns of length about 20 bytes each.

  1. INSERTS are performed at a rate of hundreds of times per second.

  2. SELECT statements are performed based on column 'a' (where a='xxxx' ) a few times per hour.

  3. DELETE statements are performed based on a DATE column. (delete where date older than 1 year) usually once per day.

The key requirement is to speed up INSERT and SELECT statements, and be able to keep history data of 1 year back without locking the whole table down while deleting.

I would guess that I must have two indexes, one for column 'a', and the other for the date field. or is it possible to optimize both?

Will there be a necessary trade-off between speed on select and speed of delete?

Is partitioning the only solution? What are good strategies for partitioning such table?

I'm using a PostgreSQL 8.4 database.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

小ぇ时光︴ 2024-08-30 10:54:33

您是否没有研究过 PostgreSQL 分区<,而不是将其保留为单个物理表/a>?从 8.1 版本开始受支持。

分区可以帮助您避免在快速插入和快速删除性能之间进行选择的问题。您始终可以按年/月对表进行分区,然后删除不再需要的分区。删除分区非常快,插入小分区也非常快。

从手册中:

<块引用>
<块引用>

分区是指将逻辑上的一个大表分割成
较小的物理碎片。分区
可以提供多种好处:


  • 对于某些特定情况,查询性能可以显着提高
    各种查询。
  • 更新性能也可以得到提高,因为每个部分
    表的索引小于
    整个数据集的索引将是。
    当索引不再容易适应时
    内存,读和写操作
    在指数上逐渐增加
    磁盘访问。
  • 只需删除其中一项即可完成批量删除
    分区,如果该要求是
    规划到分区设计中。
    DROP TABLE 比批量操作快得多
    删除,更不用说接下来的事情了
    真空吸尘。
  • 很少使用的数据可以迁移到更便宜且速度较慢的存储中
    媒体。

<块引用>
<块引用>

通常只有当桌子可以使用时,这些好处才值得
否则会非常大。确切的
桌子将受益的点
分区取决于
应用程序,尽管是经验法则
就是桌子的大小应该
超过物理内存
数据库服务器。

目前,PostgreSQL 支持通过表继承进行分区。
每个分区必须创建为
单个父表的子表。
父表本身通常是
空的;它的存在只是为了代表
整个数据集。你应该是
熟悉继承(参见部分
5.8)在尝试实现分区之前。


Rather than keeping it a single physical table, have you looked into PostgreSQL partitioning? It's supported as of version 8.1.

Partitioning can help you avoid the problem of choosing between fast INSERT vs fast DELETE performance. You can always partition the table by Year/Month, and just drop the partitions that you no longer need. Dropping partitions is extremely fast, and inserting into small partitions is also extremely fast.

From the manual:

Partitioning refers to splitting what is logically one large table into
smaller physical pieces. Partitioning
can provide several benefits:

  • Query performance can be improved dramatically for certain
    kinds of queries.
  • Update performance can be improved too, since each piece of the
    table has indexes smaller than an
    index on the entire data set would be.
    When an index no longer fits easily in
    memory, both read and write operations
    on the index take progressively more
    disk accesses.
  • Bulk deletes may be accomplished by simply removing one of
    the partitions, if that requirement is
    planned into the partitioning design.
    DROP TABLE is far faster than a bulk
    DELETE, to say nothing of the ensuing
    VACUUM overhead.
  • Seldom-used data can be migrated to cheaper and slower storage
    media.

The benefits will normally be worthwhile only when a table would
otherwise be very large. The exact
point at which a table will benefit
from partitioning depends on the
application, although a rule of thumb
is that the size of the table should
exceed the physical memory of the
database server.

Currently, PostgreSQL supports partitioning via table inheritance.
Each partition must be created as a
child table of a single parent table.
The parent table itself is normally
empty; it exists just to represent the
entire data set. You should be
familiar with inheritance (see Section
5.8) before attempting to implement partitioning.

枕梦 2024-08-30 10:54:33

正如其他人所说,分区是您的答案,但是:

我会根据一些 hash(a) 进行分区。如果 a 是一个整数,那么 a%256 就可以了。如果它是文本,则类似于 substring(md5(a) for 2)

它将加快插入和选择的速度。

对于删除,我会让它们运行得更频繁,但更小,并且也进行分区。我每小时运行一次(在 XX:30),如下所示:

delete from table_name
where date<(current_date - interval '1 year')
and
  hash(a)
  =
  (extract(doy from current_timestamp) * 24
    + extract(hour from current_timestamp))::int % 256;

编辑:我刚刚测试了这个:

create function hash(a text) returns text as $ select substring(md5($1) for 1) $ language sql immutable strict;
CREATE TABLE tablename (id text, mdate date);
CREATE TABLE tablename_partition_0 ( CHECK ( hash(id) = '0' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_1 ( CHECK ( hash(id) = '1' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_2 ( CHECK ( hash(id) = '2' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_3 ( CHECK ( hash(id) = '3' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_4 ( CHECK ( hash(id) = '4' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_5 ( CHECK ( hash(id) = '5' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_6 ( CHECK ( hash(id) = '6' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_7 ( CHECK ( hash(id) = '7' ) ) INHERITS (tablename); 
CREATE TABLE tablename_partition_8 ( CHECK ( hash(id) = '8' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_9 ( CHECK ( hash(id) = '9' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_a ( CHECK ( hash(id) = 'a' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_b ( CHECK ( hash(id) = 'b' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_c ( CHECK ( hash(id) = 'c' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_d ( CHECK ( hash(id) = 'd' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_e ( CHECK ( hash(id) = 'e' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_f ( CHECK ( hash(id) = 'f' ) ) INHERITS (tablename);
analyze;
explain select * from tablename where id='bar' and hash(id)=hash('bar');
 查询计划                                          
-------------------------------------------------- -------------------------------------------
 结果(成本=0.00..69.20行=2宽度=36)
   ->附加(成本=0.00..69.20行=2宽度=36)
         ->对表名进行顺序扫描(成本=0.00..34.60 行=1 宽度=36)
               过滤器: ((id = 'bar'::text) AND ("子字符串"(md5(id), 1, 1) = '3'::text))
         ->对 tablename_partition_3 表名进行顺序扫描(成本=0.00..34.60 行=1 宽度=36)
               过滤器: ((id = 'bar'::text) AND ("子字符串"(md5(id), 1, 1) = '3'::text))
(6 行)

您需要将 hash(id)=hash('searched_value') 添加到查询中,否则 Postgres 将搜索所有表。


编辑:您还可以使用规则系统自动插入以更正表:

create rule tablename_rule_0 as
  on insert to tablename where hash(NEW.id)='0'
  do instead insert into tablename_partition_0 values (NEW.*);
create rule tablename_rule_1 as
  on insert to tablename where hash(NEW.id)='1'
  do instead insert into tablename_partition_1 values (NEW.*);
-- and so on
insert into tablename (id) values ('a');
select * from tablename_partition_0;
 id | mdate 
----+-------
 a  | 
(1 row)

Partitioning is your answer, as others stated, but:

I'd partition on some hash(a). If a is an integer then a%256 would be good. If it is a text then something like substring(md5(a) for 2).

It will speed up inserts and selects.

For deletes I'd make them run more often but smaller and also partitioned. I'd run them every hour (at XX:30) and like this:

delete from table_name
where date<(current_date - interval '1 year')
and
  hash(a)
  =
  (extract(doy from current_timestamp) * 24
    + extract(hour from current_timestamp))::int % 256;

EDIT: I've just tested this:

create function hash(a text) returns text as $ select substring(md5($1) for 1) $ language sql immutable strict;
CREATE TABLE tablename (id text, mdate date);
CREATE TABLE tablename_partition_0 ( CHECK ( hash(id) = '0' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_1 ( CHECK ( hash(id) = '1' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_2 ( CHECK ( hash(id) = '2' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_3 ( CHECK ( hash(id) = '3' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_4 ( CHECK ( hash(id) = '4' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_5 ( CHECK ( hash(id) = '5' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_6 ( CHECK ( hash(id) = '6' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_7 ( CHECK ( hash(id) = '7' ) ) INHERITS (tablename); 
CREATE TABLE tablename_partition_8 ( CHECK ( hash(id) = '8' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_9 ( CHECK ( hash(id) = '9' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_a ( CHECK ( hash(id) = 'a' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_b ( CHECK ( hash(id) = 'b' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_c ( CHECK ( hash(id) = 'c' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_d ( CHECK ( hash(id) = 'd' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_e ( CHECK ( hash(id) = 'e' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_f ( CHECK ( hash(id) = 'f' ) ) INHERITS (tablename);
analyze;
explain select * from tablename where id='bar' and hash(id)=hash('bar');
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Result  (cost=0.00..69.20 rows=2 width=36)
   ->  Append  (cost=0.00..69.20 rows=2 width=36)
         ->  Seq Scan on tablename  (cost=0.00..34.60 rows=1 width=36)
               Filter: ((id = 'bar'::text) AND ("substring"(md5(id), 1, 1) = '3'::text))
         ->  Seq Scan on tablename_partition_3 tablename  (cost=0.00..34.60 rows=1 width=36)
               Filter: ((id = 'bar'::text) AND ("substring"(md5(id), 1, 1) = '3'::text))
(6 rows)

You'd need to add hash(id)=hash('searched_value') to your queries or Postgres will search all tables.


EDIT: You can also use rule system for automatic insertions to correct tables:

create rule tablename_rule_0 as
  on insert to tablename where hash(NEW.id)='0'
  do instead insert into tablename_partition_0 values (NEW.*);
create rule tablename_rule_1 as
  on insert to tablename where hash(NEW.id)='1'
  do instead insert into tablename_partition_1 values (NEW.*);
-- and so on
insert into tablename (id) values ('a');
select * from tablename_partition_0;
 id | mdate 
----+-------
 a  | 
(1 row)
胡大本事 2024-08-30 10:54:33

一种解决方案是根据插入日期进行分区。

也就是说,您的应用程序(或 DAO)根据某种逻辑来决定插入哪个表,该逻辑结合了当前日期(或者更确切地说,自上次分区切片启动以来的时间)和/或“最后”分区的大小。或者将此类逻辑卸载到日常脚本中,并让脚本填充一些“这是要使用的分区”以供 DAO 使用。

这立即消除了您删除“旧”行的需要(只需删除旧分区);
它还确保您的插入定期开始填充小表,这除其他外,加快了“平均”INSERT/SELECT 速度(当然,最坏的情况仍然很慢)

One solution is to partition based on insert date.

Namely, your application (or DAO) decides which table to insert into based on some logic which combines current date (or rather time since last partition slice was started) and/or te size of "last" partition. Or offload such logic into a daily script, and have the script populate some "this is the partition to use" for DAO's use.

This immediately removes your need to delete "old" rows (just drop the old partition);
it also ensures that your inserts periodically start populating small table, which, among other things, speeds up the "average" INSERT/SELECT speed (worst case scenarios are still just as slow of course)

微暖i 2024-08-30 10:54:33

如果您要将此表分解为适当的分区,您将能够使用截断而不是删除,这将降低您的维护成本,因为它不会产生死空间。

If you were to break this table up into proper partitioning, you would be able to use truncate instead of delete which would decrease your maintenance costs as it does not create dead space.

丘比特射中我 2024-08-30 10:54:33

我不是专家,但似乎对“a”列进行分区会加快您的选择速度,但对日期进行分区(正如所有其他答案所建议的那样)会加快删除速度(删除表),但对您来说毫无用处选择。

看来,这两种情况都会提高插入性能。

有专家关心这个问题吗?
对两个字段进行分区是否可能/有用?

I am no expert but it seems that partitioning on the column "a" would speed up your selects but partitioning on the date (as all the other answers are suggesting) would speed up the deleting (drop the table) but would be useless for your select.

It seems, both cases would enhance the insert performance.

Any expert care to weight in on the issue ?
Is it possible / useful to partition on both fields ?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文