批量插入 Postgres 的最快方法是什么?

发布于 2024-07-17 04:03:38 字数 1705 浏览 6 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

爱你不解释 2024-07-24 04:03:38

PostgreSQL 有关于如何最好地最初填充数据库的指南,并且它们建议使用 COPY 命令批量加载行。 该指南还提供了一些关于如何加快该过程的其他好技巧,例如在加载数据之前删除索引和外键(然后将其添加回来)。

PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).

要走干脆点 2024-07-24 04:03:38

除了使用 COPY 之外,还有一种替代方法,即 Postgres 支持的多行值语法。 从文档

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

上面的代码插入两行,但是你可以任意扩展它,直到达到准备好的语句令牌的最大数量(可能是 999 美元,但我对此不是 100% 确定)。 有时无法使用 COPY,而这是一种值得替代的方案。

There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.

墨小墨 2024-07-24 04:03:38

加快速度的一种方法是在事务中显式执行多次插入或复制(例如 1000 次)。 Postgres 的默认行为是在每个语句之后提交,因此通过批量提交,您可以避免一些开销。 正如丹尼尔的回答中的指南所说,您可能必须禁用自动提交才能使其工作。 另请注意底部的评论,建议将 wal_buffers 的大小增加到 16 MB 也可能有所帮助。

One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.

金兰素衣 2024-07-24 04:03:38

UNNEST 数组函数可以是与多行 VALUES 语法一起使用。 我认为这种方法比使用 COPY 慢,但在使用 psycopg 和 python 时对我很有用(python list 传递给 cursor.execute< /code> 变成 pg ARRAY):

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
    UNNEST(ARRAY[1, 2, 3]), 
    UNNEST(ARRAY[100, 200, 300]), 
    UNNEST(ARRAY['a', 'b', 'c'])
);

没有 VALUES 使用带有附加存在检查的子选择:

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
    SELECT UNNEST(ARRAY[1, 2, 3]), 
           UNNEST(ARRAY[100, 200, 300]), 
           UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
    SELECT 1 FROM tablename tt
    WHERE tt.fieldname1=temptable.fieldname1
);

与批量更新相同的语法:

UPDATE tablename
SET fieldname1=temptable.data
FROM (
    SELECT UNNEST(ARRAY[1,2]) AS id,
           UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;

UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
    UNNEST(ARRAY[1, 2, 3]), 
    UNNEST(ARRAY[100, 200, 300]), 
    UNNEST(ARRAY['a', 'b', 'c'])
);

without VALUES using subselect with additional existance check:

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
    SELECT UNNEST(ARRAY[1, 2, 3]), 
           UNNEST(ARRAY[100, 200, 300]), 
           UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
    SELECT 1 FROM tablename tt
    WHERE tt.fieldname1=temptable.fieldname1
);

the same syntax to bulk updates:

UPDATE tablename
SET fieldname1=temptable.data
FROM (
    SELECT UNNEST(ARRAY[1,2]) AS id,
           UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;
意犹 2024-07-24 04:03:38

     ((这是一个WIKI,你可以编辑和增强答案!))

外部文件是最好的和典型的批量数据

术语“批量数据”与“大量数据”相关,因此很自然地使用<原始原始数据,无需转换为SQL。 “批量插入”的典型原始数据文件是 CSVJSON 格式。

通过一些转换进行批量插入

ETL 应用程序和摄取流程中,我们需要更改插入数据之前。 临时表会消耗(大量)磁盘空间,而且这并不是更快的方法。 PostgreSQL 外部数据包装 (FDW) 是最佳选择。

CSV 示例。 假设 SQL 上的表名 (x, y, z) 和一个 CSV 文件,例如

fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...

您可以使用经典的 SQL COPY 来加载(原样原始数据)到tmp_tablename,它们将过滤后的数据插入tablename...但是,为了避免磁盘消耗,最好是直接摄取

INSERT INTO tablename (x, y, z)
  SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms 
  FROM tmp_tablename_fdw
  -- WHERE condictions
;

您需要为FDW准备数据库,而不是静态tmp_tablename_fdw,您可以使用生成它的函数

CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
  ...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');

JSON示例。 一组两个文件,myRawData1.jsonRanger_Policies2.json 可以通过以下方式摄取:

INSERT INTO tablename (fname, metadata, content)
 SELECT fname, meta, j  -- do any data transformation here
 FROM jsonb_read_files('myRawData%.json')
 -- WHERE any_condiction_here
;

其中函数 jsonb_read_files() 读取一个文件的所有文件文件夹,由掩码定义:

CREATE or replace FUNCTION jsonb_read_files(
  p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
  WITH t AS (
     SELECT (row_number() OVER ())::int id, 
           f AS fname,
           p_fpath ||'/'|| f AS f
     FROM pg_ls_dir(p_fpath) t(f)
     WHERE f LIKE p_flike
  ) SELECT id, fname,
         to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
         pg_read_file(f)::jsonb
    FROM t
$f$  LANGUAGE SQL IMMUTABLE;

缺乏 gzip 流

最常见的“文件摄取”方法(主要在大数据中)是在 gzip 格式并使用流媒体算法传输它,任何可以快速运行,并且在 unix 管道中不会消耗磁盘:

 gunzip remote_or_local_file.csv.gz | convert_to_sql | psql 

所以理想的(未来)是格式 .csv.gz服务器选项

@CharlieClark 评论后请注意:目前(2022)无事可做,最好的替代方案似乎 pgloader STDIN

  gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo

      ((this is a WIKI you can edit and enhance the answer!))

The external file is the best and typical bulk-data

The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.

Bulk insert with some transformation

In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.

CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like

fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...

You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by

INSERT INTO tablename (x, y, z)
  SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms 
  FROM tmp_tablename_fdw
  -- WHERE condictions
;

You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:

CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
  ...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');

JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:

INSERT INTO tablename (fname, metadata, content)
 SELECT fname, meta, j  -- do any data transformation here
 FROM jsonb_read_files('myRawData%.json')
 -- WHERE any_condiction_here
;

where the function jsonb_read_files() reads all files of a folder, defined by a mask:

CREATE or replace FUNCTION jsonb_read_files(
  p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
  WITH t AS (
     SELECT (row_number() OVER ())::int id, 
           f AS fname,
           p_fpath ||'/'|| f AS f
     FROM pg_ls_dir(p_fpath) t(f)
     WHERE f LIKE p_flike
  ) SELECT id, fname,
         to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
         pg_read_file(f)::jsonb
    FROM t
$f$  LANGUAGE SQL IMMUTABLE;

Lack of gzip streaming

The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:

 gunzip remote_or_local_file.csv.gz | convert_to_sql | psql 

So ideal (future) is a server option for format .csv.gz.

Note after @CharlieClark comment: currently (2022) nothing to do, the best alternative seems pgloader STDIN:

  gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo
手心的温暖 2024-07-24 04:03:38

您可以使用 COPY table TO ...WITH BINARY ,即“比文本和 CSV 格式更快。” 仅当您有数百万行要插入并且您对二进制数据感到满意时才执行此操作。

这是一个 Python 中的示例配方,使用带有二进制输入的 psycopg2< /a>.

You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.

Here is an example recipe in Python, using psycopg2 with binary input.

时光清浅 2024-07-24 04:03:38

它主要取决于数据库中的(其他)活动。 这样的操作有效地冻结了其他会话的整个数据库。 另一个考虑因素是数据模型以及约束、触发器等的存在。

我的第一种方法始终是:创建一个结构与目标表类似的(临时)表(create table tmp AS select * from target where 1= 0),然后首先将文件读入临时表。
然后我检查可以检查的内容:重复项、目标中已存在的键等。

然后我只需执行 do insert into target select * from tmp 或类似操作。

如果失败或花费太长时间,我会中止它并考虑其他方法(暂时删除索引/约束等)

It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.

My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table.
Then I check what can be checked: duplicates, keys that already exist in the target, etc.

Then I just do a do insert into target select * from tmp or similar.

If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)

顾冷 2024-07-24 04:03:38

我刚刚遇到这个问题,建议 csvsql (releases) 用于批量导入到 Postgres。 要执行批量插入,您只需createdb,然后使用csvsql,它连接到您的数据库并为整个 CSV 文件夹创建单独的表。

$ createdb test 
$ csvsql --db postgresql:///test --insert examples/*.csv

I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.

$ createdb test 
$ csvsql --db postgresql:///test --insert examples/*.csv
孤单情人 2024-07-24 04:03:38

也许我已经迟到了。 但是,Bytefish 有一个名为 pgbulkinsert 的 Java 库。 我和我的团队能够在 15 秒内批量插入 100 万条记录。 当然,我们还执行了一些其他操作,例如从 Minio 上的文件中读取 1M+ 记录,对 1M+ 记录进行一些处理,过滤掉重复的记录,然后最终将 1M 记录插入 Postgres 数据库。 而所有这些过程都在15秒之内完成。 我不记得执行数据库操作到底花了多少时间,但我认为大约不到 5 秒。 从 https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html

May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html

猥︴琐丶欲为 2024-07-24 04:03:38

正如其他人所指出的,将数据导入 Postgres 时,Postgres 专为您执行的检查会减慢速度。 此外,您经常需要以一种或另一种方式操作数据以使其适合使用。 任何可以在 Postgres 进程之外完成的操作都意味着您可以使用 COPY 协议进行导入。

为了我的使用,我定期使用 httparchive.org 项目导入数据.io" rel="nofollow noreferrer">pgloader。 由于源文件是由 MySQL 创建的,因此您需要能够处理一些 MySQL 奇怪的情况,例如使用 \N 表示空值以及编码问题。 这些文件也太大了,至少在我的机器上,使用 FDW 会耗尽内存。 pgloader 可以轻松创建一个管道,让您选择所需的字段、转换为相关数据类型以及在进入主数据库之前进行任何其他工作,从而最大限度地减少索引更新等。

As others have noted, when importing data into Postgres, things will be slowed by the checks that Postgres is designed to do for you. Also, you often need to manipulate the data in one way or another so that it's suitable for use. Any of this that can be done outside of the Postgres process will mean that you can import using the COPY protocol.

For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.

风和你 2024-07-24 04:03:38

下面的查询可以创建具有generate_seriestest,其中包含10000行。 *我通常会创建这样的test来测试查询性能,你可以检查generate_series()

CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
 count
-------
 10000
(1 row)
postgres=# SELECT * FROM test;
 generate_series
-----------------
               1
               2
               3
               4
               5
               6
-- More --

如果您已经有 <强>测试表:

INSERT INTO test (generate_series) SELECT generate_series(1, 10000);

The query below can create test table with generate_series column which has 10000 rows. *I usually create such test table to test query performance and you can check generate_series():

CREATE TABLE test AS SELECT generate_series(1, 10000);
postgres=# SELECT count(*) FROM test;
 count
-------
 10000
(1 row)
postgres=# SELECT * FROM test;
 generate_series
-----------------
               1
               2
               3
               4
               5
               6
-- More --

And, run the query below to insert 10000 rows if you've already had test table:

INSERT INTO test (generate_series) SELECT generate_series(1, 10000);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文