对 300 万行的 PostgreSQL 数据库进行缓慢的简单更新查询

发布于 2024-09-11 18:18:28 字数 796 浏览 1 评论 0原文

我正在 Postegres 8.4 上的一个大约有 300 万行的表上尝试一个简单的 UPDATE table SET column1 = 0 ,但它需要很长时间才能完成。已经运行了10多分钟了。

之前,我尝试在该表上运行 VACUUM 和 ANALYZE 命令,并且还尝试创建一些索引(尽管我怀疑这在这种情况下会产生任何影响),但似乎没有任何帮助。

还有其他想法吗?

更新:

这是表结构:

CREATE TABLE myTable
(
  id bigserial NOT NULL,
  title text,
  description text,
  link text,
  "type" character varying(255),
  generalFreq real,
  generalWeight real,
  author_id bigint,
  status_id bigint,
  CONSTRAINT resources_pkey PRIMARY KEY (id),
  CONSTRAINT author_pkey FOREIGN KEY (author_id)
      REFERENCES users (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT c_unique_status_id UNIQUE (status_id)
);

我正在尝试运行UPDATE myTable SET GeneralFreq = 0;

I am trying a simple UPDATE table SET column1 = 0 on a table with about 3 million rows on Postegres 8.4 but it is taking forever to finish. It has been running for more than 10 min.

Before, I tried to run a VACUUM and ANALYZE commands on that table and I also tried to create some indexes (although I doubt this will make any difference in this case) but none seems to help.

Any other ideas?

Update:

This is the table structure:

CREATE TABLE myTable
(
  id bigserial NOT NULL,
  title text,
  description text,
  link text,
  "type" character varying(255),
  generalFreq real,
  generalWeight real,
  author_id bigint,
  status_id bigint,
  CONSTRAINT resources_pkey PRIMARY KEY (id),
  CONSTRAINT author_pkey FOREIGN KEY (author_id)
      REFERENCES users (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT c_unique_status_id UNIQUE (status_id)
);

I am trying to run UPDATE myTable SET generalFreq = 0;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

习ぎ惯性依靠 2024-09-18 18:18:29

我建议的第一件事(来自 https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row)仅更新行“需要”它,例如:(

 UPDATE myTable SET generalFreq = 0 where generalFreq != 0;

可能还需要 GeneralFreq 上的索引)。然后您将更新更少的行。虽然如果值已经全部非零,则不会,但更新更少的行“可以有所帮助”,因为否则它会更新它们和所有索引,无论值是否更改。

另一种选择:如果星星按照默认值和非空约束对齐,您可以删除旧列并创建另一个只需调整元数据,即时时间。

The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:

 UPDATE myTable SET generalFreq = 0 where generalFreq != 0;

(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.

Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.

机场等船 2024-09-18 18:18:29

在我的测试中,我注意到一次超过 200 000 行的大更新比 2 次 100 000 行的更新慢,即使使用临时表也是如此。

我的解决方案是循环,在每个循环中创建一个 200 000 行的临时表,在这个表中我计算我的值,然后用新值更新我的主表...

每 2 000 000 行,我手动“VACUUM ANALYZE” mytable”,我注意到自动真空吸尘器无法完成此类更新的工作。

In my tests I noticed that a big update, more than 200 000 rows, is slower than 2 updates of 100 000 rows, even with a temporary table.

My solution is to loop, in each loop create a temporary table of 200 000 rows, in this table I compute my values, then update my main table with the new values aso...

Every 2 000 000 rows, I manually "VACUUM ANALYSE mytable", I noticed that the auto vacuum doesn't do its job for such updates.

美人骨 2024-09-18 18:18:29

我需要更新 PostgreSQL 表上超过 1B+ 的行,其中包含一些索引。我正在研究 PostgreSQL 12 + SQLAlchemy + Python。

受到这里答案的启发,我编写了一个临时表和基于 UPDATE...FROM 的更新程序,看看它是否有所不同。然后,临时表由 Python 生成的 CSV 提供,并通过普通 SQL 客户端连接上传。

使用 SQLAlchemy 的 加速天真的方法bulk_update_mappings 是 4x - 5x。虽然不是一个数量级,但仍然相当大,就我而言,这意味着批处理作业需要 1 天,而不是 1 周。

下面是执行 CREATE TEMPORARY TABLECOPY FROMUPDATE FROM 的相关 Python 代码。 查看此要点中的完整示例

def bulk_load_psql_using_temp_table(
        dbsession: Session,
        data_as_dicts: List[dict],
):
    """Bulk update columns in PostgreSQL faster using temp table.

    Works around speed issues on `bulk_update_mapping()` and PostgreSQL.
    Your mileage and speed may vary, but it is going to be faster.
    The observation was 3x ... 4x faster when doing UPDATEs
    where one of the columns is indexed.

    Contains hardcoded temp table creation and UPDATE FROM statements.
    In our case we are bulk updating three columns.

    - Create a temp table - if not created before

    - Filling it from the in-memory CSV using COPY FROM

    - Then performing UPDATE ... FROM on the actual table from the temp table

    - Between the update chunks, clear the temp table using TRUNCATE

    Why is it faster? I have did not get a clear answer from the sources I wa reading.
    At least there should be
    less data uploaded from the client to the server,
    as CSV loading is more compact than bulk updates.

    Further reading

    - `About PSQL temp tables <https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/>`_

    - `Naive bulk_update_mapping approach <https://stackoverflow.com/questions/36272316/using-bulk-update-mappings-in-sqlalchemy-to-update-multiple-rows-with-different>`_

    - `Discussion on UPDATE ... FROM + temp table approach <https://stackoverflow.com/questions/3361291/slow-simple-update-query-on-postgresql-database-with-3-million-rows/24811058#24811058>_`.

    :dbsession:
        SQLAlchemy session.
        Note that we open a separate connection for the bulk update.

    :param data_as_dicts:
        In bound data as it would be given to bulk_update_mapping
    """

    # mem table created in sql
    temp_table_name = "temp_bulk_temp_loader"

    # the real table of which data we are filling
    real_table_name = "swap"

    # colums we need to copy
    columns = ["id", "sync_event_id", "sync_reserve0", "sync_reserve1"]

    # how our CSV fields are separated
    delim = ";"

    # Increase temp buffer size for updates
    temp_buffer_size = "3000MB"

    # Dump data to a local mem buffer using CSV writer.
    # No header - this is specifically addressed in copy_from()
    out = StringIO()
    writer = csv.DictWriter(out, fieldnames=columns, delimiter=delim)
    writer.writerows(data_as_dicts)

    # Update data in alternative raw connection
    engine = dbsession.bind
    conn = engine.connect()

    try:
        # No rollbacks
        conn.execution_options(isolation_level="AUTOCOMMIT")

        # See https://blog.codacy.com/how-to-update-large-tables-in-postgresql/
        conn.execute(f"""SET temp_buffers = "{temp_buffer_size}";""")

        # Temp table is dropped at the end of the session
        # https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/
        # This must match data_as_dicts structure.
        sql = f"""
        CREATE TEMP TABLE IF NOT EXISTS {temp_table_name}
        (
          id int,
          sync_event_id int,
          sync_reserve0 bytea,
          sync_reserve1 bytea
        );    
        """
        conn.execute(sql)

        # Clean any pending data in the temp table
        # between update chunks.
        # TODO: Not sure why this does not clear itself at conn.close()
        # as I would expect based on the documentation.
        sql = f"TRUNCATE {temp_table_name}"
        conn.execute(sql)

        # Load data from CSV to the temp table
        # https://www.psycopg.org/docs/cursor.html
        cursor = conn.connection.cursor()
        out.seek(0)
        cursor.copy_from(out, temp_table_name, sep=delim, columns=columns)

        # Fill real table from the temp table
        # This copies values from the temp table using
        # UPDATE...FROM and matching by the row id.
        sql = f"""
        UPDATE {real_table_name}  
        SET 
            sync_event_id=b.sync_event_id,
            sync_reserve0=b.sync_reserve0,
            sync_reserve1=b.sync_reserve1        
        FROM {temp_table_name} AS b 
        WHERE {real_table_name}.id=b.id;
        """
        res = conn.execute(sql)
        logger.debug("Updated %d rows", res.rowcount)
    finally:
        conn.close()

I need to update more than 1B+ rows on PostgreSQL table which contains some indexes. I am working on PostgreSQL 12 + SQLAlchemy + Python.

Inspired by the answers here, I wrote a temp table and UPDATE... FROM based updater to see if it makes a difference. The temp table is then fed from CSV generated by Python, and uploaded over the normal SQL client connection.

The speed-up naive approach using SQLAlchemy's bulk_update_mappings is 4x - 5x. Not an order of magnitude, but still considerable and in my case this means 1 day, not 1 week, of a batch job.

Below is the relevant Python code that does CREATE TEMPORARY TABLE, COPY FROM and UPDATE FROM. See the full example in this gist.

def bulk_load_psql_using_temp_table(
        dbsession: Session,
        data_as_dicts: List[dict],
):
    """Bulk update columns in PostgreSQL faster using temp table.

    Works around speed issues on `bulk_update_mapping()` and PostgreSQL.
    Your mileage and speed may vary, but it is going to be faster.
    The observation was 3x ... 4x faster when doing UPDATEs
    where one of the columns is indexed.

    Contains hardcoded temp table creation and UPDATE FROM statements.
    In our case we are bulk updating three columns.

    - Create a temp table - if not created before

    - Filling it from the in-memory CSV using COPY FROM

    - Then performing UPDATE ... FROM on the actual table from the temp table

    - Between the update chunks, clear the temp table using TRUNCATE

    Why is it faster? I have did not get a clear answer from the sources I wa reading.
    At least there should be
    less data uploaded from the client to the server,
    as CSV loading is more compact than bulk updates.

    Further reading

    - `About PSQL temp tables <https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/>`_

    - `Naive bulk_update_mapping approach <https://stackoverflow.com/questions/36272316/using-bulk-update-mappings-in-sqlalchemy-to-update-multiple-rows-with-different>`_

    - `Discussion on UPDATE ... FROM + temp table approach <https://stackoverflow.com/questions/3361291/slow-simple-update-query-on-postgresql-database-with-3-million-rows/24811058#24811058>_`.

    :dbsession:
        SQLAlchemy session.
        Note that we open a separate connection for the bulk update.

    :param data_as_dicts:
        In bound data as it would be given to bulk_update_mapping
    """

    # mem table created in sql
    temp_table_name = "temp_bulk_temp_loader"

    # the real table of which data we are filling
    real_table_name = "swap"

    # colums we need to copy
    columns = ["id", "sync_event_id", "sync_reserve0", "sync_reserve1"]

    # how our CSV fields are separated
    delim = ";"

    # Increase temp buffer size for updates
    temp_buffer_size = "3000MB"

    # Dump data to a local mem buffer using CSV writer.
    # No header - this is specifically addressed in copy_from()
    out = StringIO()
    writer = csv.DictWriter(out, fieldnames=columns, delimiter=delim)
    writer.writerows(data_as_dicts)

    # Update data in alternative raw connection
    engine = dbsession.bind
    conn = engine.connect()

    try:
        # No rollbacks
        conn.execution_options(isolation_level="AUTOCOMMIT")

        # See https://blog.codacy.com/how-to-update-large-tables-in-postgresql/
        conn.execute(f"""SET temp_buffers = "{temp_buffer_size}";""")

        # Temp table is dropped at the end of the session
        # https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/
        # This must match data_as_dicts structure.
        sql = f"""
        CREATE TEMP TABLE IF NOT EXISTS {temp_table_name}
        (
          id int,
          sync_event_id int,
          sync_reserve0 bytea,
          sync_reserve1 bytea
        );    
        """
        conn.execute(sql)

        # Clean any pending data in the temp table
        # between update chunks.
        # TODO: Not sure why this does not clear itself at conn.close()
        # as I would expect based on the documentation.
        sql = f"TRUNCATE {temp_table_name}"
        conn.execute(sql)

        # Load data from CSV to the temp table
        # https://www.psycopg.org/docs/cursor.html
        cursor = conn.connection.cursor()
        out.seek(0)
        cursor.copy_from(out, temp_table_name, sep=delim, columns=columns)

        # Fill real table from the temp table
        # This copies values from the temp table using
        # UPDATE...FROM and matching by the row id.
        sql = f"""
        UPDATE {real_table_name}  
        SET 
            sync_event_id=b.sync_event_id,
            sync_reserve0=b.sync_reserve0,
            sync_reserve1=b.sync_reserve1        
        FROM {temp_table_name} AS b 
        WHERE {real_table_name}.id=b.id;
        """
        res = conn.execute(sql)
        logger.debug("Updated %d rows", res.rowcount)
    finally:
        conn.close()
懒猫 2024-09-18 18:18:29

尝试

UPDATE myTable SET generalFreq = 0.0;

也许这是一个铸造问题

try

UPDATE myTable SET generalFreq = 0.0;

Maybe it is a casting issue

一生独一 2024-09-18 18:18:28

我必须使用每行的不同值来更新 1 或 20 亿行的表。每次运行都会进行约 1 亿次更改 (10%)。
我的第一次尝试是将它们直接分组在特定分区上的 300K 更新事务中,因为如果使用分区,Postgresql 并不总是优化准备好的查询。

  1. 一堆“UPDATE myTable SET myField=value WHERE”的事务
    myId=id"
    每秒1,500 次更新。这意味着每次运行都会
    至少需要18小时。
  2. 热更新解决方案,如此处所述,FILLFACTOR=50。给予
    1,600 次更新/秒。我使用 SSD,所以这是一项成本高昂的改进
    存储大小加倍。
  3. 插入更新值的临时表并在之后合并它们
    使用 UPDATE...FROM 每秒提供 18,000 更新。如果我做真空
    对于每个分区;否则为 100,000 up/s。太酷了。
    这是
    操作顺序:

CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));

根据可用 RAM 在缓冲区中累积一堆更新
当它被填满,或者需要更改表/分区,或者完成时:

COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;

这意味着现在运行 1 亿次更新需要 1.5 小时而不是 18 小时,包括真空。为了节省时间,没有必要在最后将真空吸尘器弄满,但即使是快速的常规真空吸尘器也有助于控制数据库上的事务 ID,并且不会在高峰时段出现不必要的自动真空吸尘器。

I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%).
My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.

  1. Transactions of bunch of "UPDATE myTable SET myField=value WHERE
    myId=id"
    Gives 1,500 updates/sec. which means each run would
    take at least 18 hours.
  2. HOT updates solution as described here with FILLFACTOR=50. Gives
    1,600 updates/sec. I use SSD's so it's a costly improvement as it
    doubles the storage size.
  3. Insert in a temporary table of updated value and merge them after
    with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM
    for each partition; 100,000 up/s otherwise. Cooool.
    Here is the
    sequence of operations:

CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));

Accumulate a bunch of updates in a buffer depending of available RAM
When it's filled, or need to change of table/partition, or completed:

COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;

That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included. To save time, it's not necessary to make a vacuum FULL at the end but even a fast regular vacuum is usefull to control your transaction ID on the database and not get unwanted autovacuum during rush hours.

数理化全能战士 2024-09-18 18:18:28

看看这个答案:PostgreSQL 在包含数组和大量更新的大型表上运行缓慢

首先从更好的 FILLFACTOR 开始,执行 VACUUM FULL 强制表重写并在 UPDATE 查询后检查热更新:

SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';

热更新很多当您有大量记录需要更新时,速度会更快。有关 HOT 的更多信息可以在此 文章

诗。您需要 8.3 或更高版本。

Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates

First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:

SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';

HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.

Ps. You need version 8.3 or better.

幽梦紫曦~ 2024-09-18 18:18:28

等待35分钟后。为了让我的 UPDATE 查询完成(但仍然没有),我决定尝试一些不同的东西。所以我所做的是一个命令:

CREATE TABLE table2 AS 
SELECT 
  all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable

然后添加索引,然后删除旧表并重命名新表以取代它。这只花了 1.7 分钟。处理加上一些额外的时间来重新创建索引和约束。但这确实有帮助! :)

当然,这确实有效,只是因为没有其他人使用该数据库。如果这是在生产环境中,我需要首先锁定表。

After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:

CREATE TABLE table2 AS 
SELECT 
  all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable

Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)

Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.

被翻牌 2024-09-18 18:18:28

今天我花了很多时间来解决类似的问题。我找到了一个解决方案在更新之前删除所有约束/索引。无论正在更新的列是否已建立索引,psql 似乎都会更新所有已更新行的所有索引。更新完成后,添加回约束/索引。

Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.

白龙吟 2024-09-18 18:18:28

试试这个(注意,generalFreq 以 REAL 类型开始,并保持不变):

ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;

这将重写表,类似于 DROP + CREATE,并重建所有索引。但一切都在一个命令中。速度要快得多(大约 2 倍),并且您不必处理依赖关系并重新创建索引和其他内容,尽管它会在持续时间内锁定表(访问独占,即完全锁定)。或者,如果您希望其他所有内容都在它后面排队,这也许就是您想要的。如果您没有更新“太多”行,那么这种方式比仅仅更新要慢。

Try this (note that generalFreq starts as type REAL, and stays the same):

ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;

This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.

維他命╮ 2024-09-18 18:18:28

你运行得怎么样?如果您循环每一行并执行更新语句,则可能会运行数百万个单独的更新,这就是它执行速度极其缓慢的原因。

如果您对一个语句中的所有记录运行单个更新语句,那么它的运行速度会快得多,如果这个过程很慢,那么它可能主要取决于您的硬件。 300万是一个很大的记录。

How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.

If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文