加快 ORM 的数据库插入速度
我有一个 Django 视图,它在循环中创建 500-5000 个新数据库插入。问题是,它真的很慢!我在 Postgres 8.3 上每分钟大约插入 100 次。我们过去在较小的硬件(较小的 EC2 实例)上使用 MySQL,并且从未遇到过此类速度问题。
细节: Ubuntu Server 9.04 上的 Postgres 8.3。 服务器是一个“大型”Amazon EC2,数据库位于 EBS (ext3) - 11GB/20GB。
这是我的 postgresql.conf 的一些内容——如果您需要更多,请告诉我
shared_buffers = 4000MB
effective_cache_size = 7128MB
我的 python:
for k in kw:
k = k.lower()
p = ProfileKeyword(profile=self)
logging.debug(k)
p.keyword, created = Keyword.objects.get_or_create(keyword=k, defaults={'keyword':k,})
if not created and ProfileKeyword.objects.filter(profile=self, keyword=p.keyword).count():
#checking created is just a small optimization to save some database hits on new keywords
pass #duplicate entry
else:
p.save()
顶部的一些输出:
top - 16:56:22 up 21 days, 20:55, 4 users, load average: 0.99, 1.01, 0.94
Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.8%us, 0.2%sy, 0.0%ni, 90.5%id, 0.7%wa, 0.0%hi, 0.0%si, 2.8%st
Mem: 15736360k total, 12527788k used, 3208572k free, 332188k buffers
Swap: 0k total, 0k used, 0k free, 11322048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14767 postgres 25 0 4164m 117m 114m S 22 0.8 2:52.00 postgres
1 root 20 0 4024 700 592 S 0 0.0 0:01.09 init
2 root RT 0 0 0 0 S 0 0.0 0:11.76 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
6 root 11 -5 0 0 0 S 0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
9 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenbus
18 root RT -5 0 0 0 S 0 0.0 0:11.84 migration/1
19 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1
如果有任何其他详细信息会有帮助,请告诉我。
I have a Django view which creates 500-5000 new database INSERTS in a loop. Problem is, it is really slow! I'm getting about 100 inserts per minute on Postgres 8.3. We used to use MySQL on lesser hardware (smaller EC2 instance) and never had these types of speed issues.
Details:
Postgres 8.3 on Ubuntu Server 9.04.
Server is a "large" Amazon EC2 with database on EBS (ext3) - 11GB/20GB.
Here is some of my postgresql.conf -- let me know if you need more
shared_buffers = 4000MB
effective_cache_size = 7128MB
My python:
for k in kw:
k = k.lower()
p = ProfileKeyword(profile=self)
logging.debug(k)
p.keyword, created = Keyword.objects.get_or_create(keyword=k, defaults={'keyword':k,})
if not created and ProfileKeyword.objects.filter(profile=self, keyword=p.keyword).count():
#checking created is just a small optimization to save some database hits on new keywords
pass #duplicate entry
else:
p.save()
Some output from top:
top - 16:56:22 up 21 days, 20:55, 4 users, load average: 0.99, 1.01, 0.94
Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.8%us, 0.2%sy, 0.0%ni, 90.5%id, 0.7%wa, 0.0%hi, 0.0%si, 2.8%st
Mem: 15736360k total, 12527788k used, 3208572k free, 332188k buffers
Swap: 0k total, 0k used, 0k free, 11322048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14767 postgres 25 0 4164m 117m 114m S 22 0.8 2:52.00 postgres
1 root 20 0 4024 700 592 S 0 0.0 0:01.09 init
2 root RT 0 0 0 0 S 0 0.0 0:11.76 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
6 root 11 -5 0 0 0 S 0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
9 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenbus
18 root RT -5 0 0 0 S 0 0.0 0:11.84 migration/1
19 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1
Let me know if any other details would be helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
像这样的批量操作缓慢的一个常见原因是每次插入都发生在自己的事务中。如果您可以在一次事务中实现所有这些操作,那么速度可能会快得多。
One common reason for slow bulk operations like this is each insert happening in its own transaction. If you can get all of them to happen in a single transaction, it could go much faster.
首先,ORM 操作总是比纯 SQL 慢。我曾经用 ORM 代码编写了一个大型数据库的更新并将其设置为运行,但几个小时后就退出了,因为它只完成了一小部分。用 SQL 重写后,整个过程不到一分钟就完成了。
其次,请记住,这里的代码为数据集中的每一行执行最多四个单独的数据库操作 - get_or_create 中的
get
,也可能是create
、对过滤器进行计数
,最后保存
。这是大量的数据库访问。请记住,最多 5000 个对象并不是很大,您应该能够在开始时将整个数据集读入内存。然后,您可以执行一个
filter
一次性获取所有现有的 Keyword 对象,从而节省了 Keywordget_or_create
中的大量查询,并且还避免了实例化重复项的需要个人资料关键词放在首位。Firstly, ORM operations are always going to be slower than pure SQL. I once wrote an update to a large database in ORM code and set it running, but quit it after several hours when it had completed only a tiny fraction. After rewriting it in SQL the whole thing ran in less than a minute.
Secondly, bear in mind that your code here is doing up to four separate database operations for every row in your data set - the
get
in get_or_create, possibly also thecreate
, thecount
on the filter, and finally thesave
. That's a lot of database access.Bearing in mind that a maximum of 5000 objects is not huge, you should be able to read the whole dataset into memory at the start. Then you can do a single
filter
to get all the existing Keyword objects in one go, saving a huge number of queries in the Keywordget_or_create
and also avoiding the need to instantiate duplicate ProfileKeywords in the first place.