Postgres 性能技巧加载数十亿行
我正在进行一个项目,涉及尝试从 70GB 的 xml 文档中获取大量信息并将其加载到关系数据库(在本例中为 postgres)中,我目前正在使用 python 脚本和 psycopg2 来执行此插入等操作。我发现随着一些表中的行数增加。 (其中最大的行数约为 500 万行)脚本(插入)的速度已经慢得像爬行一样。以前需要几分钟的事情现在需要大约一个小时。
我可以做什么来加快速度?我使用 python 和 psycopg2 来完成这项任务是错误的吗?我可以对数据库做些什么来加快这个过程吗?我感觉我正在以完全错误的方式处理这件事。
I am in the middle of a project involving trying to grab numerous pieces of information out of 70GB worth of xml documents and loading it into a relational database (in this case postgres) I am currently using python scripts and psycopg2 to do this inserts and whatnot. I have found that as the number of rows in the some of the tables increase. (The largest of which is at around 5 million rows) The speed of the script (inserts) has slowed to a crawl. What was once taking a couple of minutes now takes about an hour.
What can I do to speed this up? Was I wrong in using python and psycopg2 for this task? Is there anything I can do to the database that may speed up this process. I get the feeling I am going about this in entirely the wrong way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
考虑到这个过程之前相当有效,直到现在当数据集增大时,它才减慢了速度,我猜测是索引的原因。您可以尝试在导入之前删除表上的索引,并在导入完成后重新创建它们。这应该会加快速度。
Considering the process was fairly efficient before and only now when the dataset grew up it slowed down my guess is it's the indexes. You may try dropping indexes on the table before the import and recreating them after it's done. That should speed things up.
wal_buffers 和 checkpoint_segments 的设置是什么?对于大额交易,您必须调整一些设置。查看手册。
也可以考虑一下PostgreSQL 9.0 High Performance这本书,还有更多内容调整不仅仅是数据库配置以获得高性能。
What are the settings for wal_buffers and checkpoint_segments? For large transactions, you have to tweak some settings. Check the manual.
Consider the book PostgreSQL 9.0 High Performance as well, there is much more to tweak than just the database configuration to get high performance.
我尝试使用 COPY 而不是插入。这就是备份工具用于快速加载的方法。
检查该表中的所有外键是否在目标表上都有相应的索引。或者更好 - 在复制之前暂时删除它们,然后重新创建。
将 checkpoint_segments 从默认的 3(这意味着 3*16MB=48MB)增加到更高的数字 - 例如尝试 32 (512MB)。确保有足够的空间容纳这么多额外的数据。
如果您有能力在系统崩溃或电源故障时从头开始重新创建或恢复数据库集群,那么您可以使用“-F”选项启动 Postgres,这将启用操作系统写入缓存。
I'd try to use COPY instead of inserts. This is what backup tools use for fast loading.
Check if all foreign keys from this table do have corresponding index on target table. Or better - drop them temporarily before copying and recreate after.
Increase checkpoint_segments from default 3 (which means3*16MB=48MB) to a much higher number - try for example 32 (512MB). make sure you have enough space for this much additional data.
If you can afford to recreate or restore your database cluster from scratch in case of system crash or power failure then you can start Postgres with "-F" option, which will enable OS write cache.
看看 http://pgbulkload.projects.postgresql.org/
Take a look at http://pgbulkload.projects.postgresql.org/
填充数据库中有关于此主题的提示列表文档部分。您还可以使用调整您的 PostgreSQL 服务器中的提示来提高总体性能。
随着表大小的增加,检查外键的开销可能会增加,这会变得更糟,因为您一次加载一条记录。如果您要加载 70GB 的数据,则在加载期间删除外键,然后在导入时重建它们会快得多。如果您使用单个 INSERT 语句,则尤其如此。由于待处理触发队列的管理方式,切换到 COPY 也不能保证得到改进——第一个文档链接中讨论了这些问题。
从 psql 提示符中,您可以找到强制执行外键的约束的名称,然后使用该名称将其删除,如下所示:
当您完成加载时,您可以使用以下内容将其放回去:
一个有用的技巧来找出用于恢复的确切语法是在数据库上执行pg_dump --schema-only。其中的转储将向您展示如何重新创建您现在拥有的结构。
There is a list of hints on this topic in the Populating a Database section of the documentation. You might speed up general performance using the hints in Tuning Your PostgreSQL Server as well.
The overhead of checking foreign keys might be growing as the table size increases, which is made worse because you're loading a single record at a time. If you're loading 70GB worth of data, it will be far faster to drop foreign keys during the load, then rebuild them when it's imported. This is particularly true if you're using single INSERT statements. Switching to COPY instead is not a guaranteed improvement either, due to how the pending trigger queue is managed--the issues there are discussed in that first documentation link.
From the psql prompt, you can find the name of the constraint enforcing your foreign key and then drop it using that name like this:
When you're done with loading, you can put it back using something like:
One useful trick to find out the exact syntax to use for the restore is to do pg_dump --schema-only on your database. The dump from that will show you how to recreate the structure you have right now.
我会查看回滚日志。如果你在一笔交易中做到这一点,它们一定会变得相当大。
如果是这种情况,也许您可以尝试提交较小的事务批量大小。将其分成更小的记录块(1K、10K、100K 等),看看是否有帮助。
I'd look at the rollback logs. They've got to be getting pretty big if you're doing this in one transaction.
If that's the case, perhaps you can try committing a smaller transaction batch size. Chunk it into smaller blocks of records (1K, 10K, 100K, etc.) and see if that helps.
前 5 百万行没什么,插入的差异不应该改变,是 100k 还是 1 mil;
1-2 个索引不会减慢太多(如果填充因子设置为 70-90,考虑到每个主要导入是表的 1/10)。
python 与 PSYCOPG2 相当快。
一个小提示,您可以使用数据库扩展 XML2 来读取/处理数据
小示例
https://dba.stackexchange.com/questions/ 8172/sql-to-read-xml-from-file-into-postgresql-database
duffymo 是对的,尝试以 10000 条插入为单位提交(仅在最后或每次插入之后提交是相当昂贵的)
如果您进行大量删除和更新,autovacuum 可能会变得臃肿,您可以在某些表开始时临时关闭它。根据您的服务器可用资源设置work_mem和maintenance_work_mem ...
对于插入,增加 wal_buffers,(9.0 及更高版本默认设置自动 -1)如果您使用版本 8 postgresql,您应该手动增加它
cud 还关闭 fsync 并测试 wal_sync_method(请谨慎更改此设置,如果突然发生断电或硬件崩溃,可能会使数据库崩溃不安全)
尝试删除外键、禁用触发器或设置触发器不运行/跳过执行的条件;
使用准备好的语句进行插入,转换变量
您尝试将数据插入到未记录的表中以临时保存数据,
插入是否具有来自子查询、函数等的条件或值?
First 5 mil rows is nothing, difference in inserts should not change is it 100k or 1 mil;
1-2 indexes wont slow it down that much(if fill factor is set 70-90, considering each major import is 1/10 of table ).
python with PSYCOPG2 is quite fast.
a small tip, you cud use database extension XML2 to read/work with data
small example from
https://dba.stackexchange.com/questions/8172/sql-to-read-xml-from-file-into-postgresql-database
duffymo is right, try to commit in chunks of 10000 inserts (committing only at the end or after each insert is quite expensive)
autovacuum might be bloating if you do a lot of deletes and updates, you can turn it off temporary at the start for certain tables. set work_mem and maintenance_work_mem according to your servers available resources ...
for inserts, increase wal_buffers, (9.0 and higher its set auto by default -1) if u use version 8 postgresql, you should increase it manually
cud also turn fsync off and test wal_sync_method(be cautious changing this may make your database crash unsafe if sudden power-failures or hardware crash occurs)
try to drop foreign keys, disable triggers or set conditions for trigger not to run/skip execution;
use prepared statements for inserts, cast variables
you cud try to insert data into an unlogged table to temporary hold data
are inserts having where conditions or values from a sub-query, functions or such alike?