将 Python 的 Postgres psycopg2 查询性能提高到与 Java 的 JDBC 驱动程序相同的水平

发布于 2024-10-20 03:07:11 字数 5491 浏览 1 评论 0原文

概述

我正在尝试提高 SQLAlchemy 数据库查询的性能。我们正在使用 psycopg2。在我们的生产系统中,我们选择使用 Java,因为它的速度至少快了 50%,甚至接近 100%。所以我希望 Stack Overflow 社区中有人能找到一种方法来提高我的表现。

我认为我的下一步是最终修补 psycopg2 库,使其表现得像 JDBC 驱动程序。如果是这种情况并且有人已经这样做了,那就没问题了,但我希望我仍然可以通过 Python 进行设置或重构调整。

详细信息

我正在运行一个简单的“SELECT * FROM someLargeDataSetTable”查询。数据集大小为 GB。一个快速的性能图表如下:

时序表

        Records    | JDBC  | SQLAlchemy[1] |  SQLAlchemy[2] |  Psql
-------------------------------------------------------------------- 
         1 (4kB)   | 200ms |         300ms |          250ms |   10ms
        10 (8kB)   | 200ms |         300ms |          250ms |   10ms
       100 (88kB)  | 200ms |         300ms |          250ms |   10ms
     1,000 (600kB) | 300ms |         300ms |          370ms |  100ms
    10,000 (6MB)   | 800ms |         830ms |          730ms |  850ms  
   100,000 (50MB)  |    4s |            5s |           4.6s |     8s
 1,000,000 (510MB) |   30s |           50s |            50s |  1m32s  
10,000,000 (5.1GB) | 4m44s |         7m55s |          6m39s |    n/a
-------------------------------------------------------------------- 
 5,000,000 (2.6GB) | 2m30s |         4m45s |          3m52s | 14m22s
-------------------------------------------------------------------- 
[1] - With the processrow function
[2] - Without the processrow function (direct dump)

我可以添加更多(我们的数据可以多达 TB),但我认为从数据中可以明显看出斜率的变化。随着数据集大小的增加,JDBC 的性能显着提高。一些注释...

时序表注释:

  • 数据大小是近似值,但它们应该让您了解数据量。
  • 我正在使用 Linux bash 命令行中的“时间”工具。
  • 时间是挂钟时间(即真实时间)。
  • 我正在使用 Python 2.6.6,并且正在使用 python -u
  • Fetch Size is 10,000
  • 我并不真正担心 Psql 计时,它只是作为参考点。我可能没有正确设置它的 fetchsize 。
  • 我也不担心低于获取大小的时间,因为对于我的应用程序来说,少于 5 秒的时间可以忽略不计。
  • Java和Psql似乎占用了大约1GB的内存资源; Python 更像是 100MB(耶!!)。
  • 我正在使用 [cdecimals] 库。
  • 我注意到[最近的文章]讨论了与此类似的内容。看来 JDBC 驱动程序设计与 psycopg2 设计完全不同(考虑到性能差异,我认为这相当烦人)。
  • 我的用例基本上是,我必须在非常大的数据集上运行日常流程(大约 20,000 个不同的步骤...多个查询),并且我有一个非常特定的时间窗口来完成这个过程。我们使用的 Java 不仅仅是 JDBC,它是 JDBC 引擎之上的“智能”包装器……我们不想使用 Java,并且希望停止使用它的“智能”部分。
  • 我正在使用我们的生产系统的盒子之一(数据库和后端进程)来运行查询。所以这是我们最好的时机。我们的 QA 和 Dev 框运行速度要慢得多,额外的查询时间可能会变得很长。

testSqlAlchemy.py

#!/usr/bin/env python
# testSqlAlchemy.py
import sys
try:
    import cdecimal
    sys.modules["decimal"]=cdecimal
except ImportError,e:
    print >> sys.stderr, "Error: cdecimal didn't load properly."
    raise SystemExit
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

def processrow (row,delimiter="|",null="\N"):
    newrow = []
    for x in row:
        if x is None:
            x = null
        newrow.append(str(x))
    return delimiter.join(newrow)

fetchsize = 10000
connectionString = "postgresql+psycopg2://usr:pass@server:port/db"
eng = create_engine(connectionString, server_side_cursors=True)
session = sessionmaker(bind=eng)()

with open("test.sql","r") as queryFD:
   with open("/dev/null","w") as nullDev:
        query = session.execute(queryFD.read())
        cur = query.cursor
        while cur.statusmessage not in ['FETCH 0','CLOSE CURSOR']:
            for row in query.fetchmany(fetchsize):
                print >> nullDev, processrow(row)

计时后,我还运行了一个 cProfile,这是最严重的违规者的转储:

计时配置文件(带 processrow)

Fri Mar  4 13:49:45 2011    sqlAlchemy.prof

         415757706 function calls (415756424 primitive calls) in 563.923 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  563.924  563.924 {execfile}
        1   25.151   25.151  563.924  563.924 testSqlAlchemy.py:2()
     1001    0.050    0.000  329.285    0.329 base.py:2679(fetchmany)
     1001    5.503    0.005  314.665    0.314 base.py:2804(_fetchmany_impl)
 10000003    4.328    0.000  307.843    0.000 base.py:2795(_fetchone_impl)
    10011    0.309    0.000  302.743    0.030 base.py:2790(__buffer_rows)
    10011  233.620    0.023  302.425    0.030 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
 10000000  145.459    0.000  209.147    0.000 testSqlAlchemy.py:13(processrow)

计时配置文件(不带 processrow)

Fri Mar  4 14:03:06 2011    sqlAlchemy.prof

         305460312 function calls (305459030 primitive calls) in 536.368 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  536.370  536.370 {execfile}
        1   29.503   29.503  536.369  536.369 testSqlAlchemy.py:2()
     1001    0.066    0.000  333.806    0.333 base.py:2679(fetchmany)
     1001    5.444    0.005  318.462    0.318 base.py:2804(_fetchmany_impl)
 10000003    4.389    0.000  311.647    0.000 base.py:2795(_fetchone_impl)
    10011    0.339    0.000  306.452    0.031 base.py:2790(__buffer_rows)
    10011  235.664    0.024  306.102    0.031 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
 10000000   32.904    0.000  172.802    0.000 base.py:2246(__repr__)

最终评论

不幸的是,processrow 函数需要保留,除非 SQLAlchemy 中有一种方法可以指定输出的 null = 'userDefinedValueOrString' 和 delimiter = 'userDefinedValueOrString'。我们当前使用的 Java 已经做到了这一点,因此(与 processrow)需要进行同类比较。如果有一种方法可以通过纯 Python 或设置调整来提高 processrow 或 SQLAlchemy 的性能,我非常感兴趣。

Overview

I'm attempting to improve the performance of our database queries for SQLAlchemy. We're using psycopg2. In our production system, we're chosing to go with Java because it is simply faster by at least 50%, if not closer to 100%. So I am hoping someone in the Stack Overflow community has a way to improve my performance.

I think my next step is going to be to end up patching the psycopg2 library to behave like the JDBC driver. If that's the case and someone has already done this, that would be fine, but I am hoping I've still got a settings or refactoring tweak I can do from Python.

Details

I have a simple "SELECT * FROM someLargeDataSetTable" query running. The dataset is GBs in size. A quick performance chart is as follows:

Timing Table

        Records    | JDBC  | SQLAlchemy[1] |  SQLAlchemy[2] |  Psql
-------------------------------------------------------------------- 
         1 (4kB)   | 200ms |         300ms |          250ms |   10ms
        10 (8kB)   | 200ms |         300ms |          250ms |   10ms
       100 (88kB)  | 200ms |         300ms |          250ms |   10ms
     1,000 (600kB) | 300ms |         300ms |          370ms |  100ms
    10,000 (6MB)   | 800ms |         830ms |          730ms |  850ms  
   100,000 (50MB)  |    4s |            5s |           4.6s |     8s
 1,000,000 (510MB) |   30s |           50s |            50s |  1m32s  
10,000,000 (5.1GB) | 4m44s |         7m55s |          6m39s |    n/a
-------------------------------------------------------------------- 
 5,000,000 (2.6GB) | 2m30s |         4m45s |          3m52s | 14m22s
-------------------------------------------------------------------- 
[1] - With the processrow function
[2] - Without the processrow function (direct dump)

I could add more (our data can be as much as terabytes), but I think changing slope is evident from the data. JDBC just performs significantly better as the dataset size increases. Some notes...

Timing Table Notes:

  • The datasizes are approximate, but they should give you an idea of the amount of data.
  • I'm using the 'time' tool from a Linux bash commandline.
  • The times are the wall clock times (i.e. real).
  • I'm using Python 2.6.6 and I'm running with python -u
  • Fetch Size is 10,000
  • I'm not really worried about the Psql timing, it's there just as a reference point. I may not have properly set fetchsize for it.
  • I'm also really not worried about the timing below the fetch size as less than 5 seconds is negligible to my application.
  • Java and Psql appear to take about 1GB of memory resources; Python is more like 100MB (yay!!).
  • I'm using the [cdecimals] library.
  • I noticed a [recent article] discussing something similar to this. It appears that the JDBC driver design is totally different to the psycopg2 design (which I think is rather annoying given the performance difference).
  • My use-case is basically that I have to run a daily process (with approximately 20,000 different steps... multiple queries) over very large datasets and I have a very specific window of time where I may finish this process. The Java we use is not simply JDBC, it's a "smart" wrapper on top of the JDBC engine... we don't want to use Java and we'd like to stop using the "smart" part of it.
  • I'm using one of our production system's boxes (database and backend process) to run the query. So this is our best-case timing. We have QA and Dev boxes that run much slower and the extra query time can become significant.

testSqlAlchemy.py

#!/usr/bin/env python
# testSqlAlchemy.py
import sys
try:
    import cdecimal
    sys.modules["decimal"]=cdecimal
except ImportError,e:
    print >> sys.stderr, "Error: cdecimal didn't load properly."
    raise SystemExit
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

def processrow (row,delimiter="|",null="\N"):
    newrow = []
    for x in row:
        if x is None:
            x = null
        newrow.append(str(x))
    return delimiter.join(newrow)

fetchsize = 10000
connectionString = "postgresql+psycopg2://usr:pass@server:port/db"
eng = create_engine(connectionString, server_side_cursors=True)
session = sessionmaker(bind=eng)()

with open("test.sql","r") as queryFD:
   with open("/dev/null","w") as nullDev:
        query = session.execute(queryFD.read())
        cur = query.cursor
        while cur.statusmessage not in ['FETCH 0','CLOSE CURSOR']:
            for row in query.fetchmany(fetchsize):
                print >> nullDev, processrow(row)

After timing, I also ran a cProfile and this is the dump of worst offenders:

Timing Profile (with processrow)

Fri Mar  4 13:49:45 2011    sqlAlchemy.prof

         415757706 function calls (415756424 primitive calls) in 563.923 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  563.924  563.924 {execfile}
        1   25.151   25.151  563.924  563.924 testSqlAlchemy.py:2()
     1001    0.050    0.000  329.285    0.329 base.py:2679(fetchmany)
     1001    5.503    0.005  314.665    0.314 base.py:2804(_fetchmany_impl)
 10000003    4.328    0.000  307.843    0.000 base.py:2795(_fetchone_impl)
    10011    0.309    0.000  302.743    0.030 base.py:2790(__buffer_rows)
    10011  233.620    0.023  302.425    0.030 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
 10000000  145.459    0.000  209.147    0.000 testSqlAlchemy.py:13(processrow)

Timing Profile (without processrow)

Fri Mar  4 14:03:06 2011    sqlAlchemy.prof

         305460312 function calls (305459030 primitive calls) in 536.368 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  536.370  536.370 {execfile}
        1   29.503   29.503  536.369  536.369 testSqlAlchemy.py:2()
     1001    0.066    0.000  333.806    0.333 base.py:2679(fetchmany)
     1001    5.444    0.005  318.462    0.318 base.py:2804(_fetchmany_impl)
 10000003    4.389    0.000  311.647    0.000 base.py:2795(_fetchone_impl)
    10011    0.339    0.000  306.452    0.031 base.py:2790(__buffer_rows)
    10011  235.664    0.024  306.102    0.031 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
 10000000   32.904    0.000  172.802    0.000 base.py:2246(__repr__)

Final Comments

Unfortunately, the processrow function needs to stay unless there is a way within SQLAlchemy to specify null = 'userDefinedValueOrString' and delimiter = 'userDefinedValueOrString' of the output. The Java we are using currently already does this, so the comparison (with processrow) needed to be apples to apples. If there is a way to improve the performance of either processrow or SQLAlchemy with pure Python or a settings tweak, I'm very interested.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

空城之時有危險 2024-10-27 03:07:11

这不是开箱即用的答案,对于所有客户端/数据库内容,您可能需要做一些工作来确定到底什么是错误的

备份 postgresql.conf 更改

log_min_duration_statement to 0 
log_destination = 'csvlog'              # Valid values are combinations of      
logging_collector = on                # Enable capturing of stderr and csvlog 
log_directory = 'pg_log'                # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,        
debug_print_parse = on
debug_print_rewritten = on
debug_print_plan output = on
log_min_messages = info (debug1 for all server versions prior to 8.4)

停止并重新启动数据库服务器(重新加载可能不会拾取更改)
重现您的测试,确保服务器时间和客户端时间匹配,并记录开始时间等。

将日志文件复制并导入到您选择的编辑器中(Excel 或其他电子表格可用于获取 SQL 和计划的高级操作)等)

现在检查服务器端的计时并注意:

  • 在每种情况下服务器上报告的 sql 是否相同

  • 如果相同你应该有相同的计时

  • 是客户端生成游标而不是传递sql

  • 是一个驱动程序在字符集之间执行大量转换/转换或其他类型(例如日期或时间戳)的隐式转换。

为了完整性,计划数据将被包括在内

,这可以告知客户端提交的 SQL 是否存在明显差异。

This is not an answer out of the box, with all client/db stuff you may need to do some work to determine exactly what is amiss

backup postgresql.conf changing

log_min_duration_statement to 0 
log_destination = 'csvlog'              # Valid values are combinations of      
logging_collector = on                # Enable capturing of stderr and csvlog 
log_directory = 'pg_log'                # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,        
debug_print_parse = on
debug_print_rewritten = on
debug_print_plan output = on
log_min_messages = info (debug1 for all server versions prior to 8.4)

Stop and restart your database server ( reload may not pick up the changes )
Reproduce your tests ensuring that the server time and client times match and that you record the start times etc.

copy the log file off an import into editor of your choice (excel or another spreadsheet can be useful for getting advance manipulation for sql & plans etc)

now examine the timings from the server side and note:

  • is the sql reported on the server the same in each case

  • if the same you should have the same timings

  • is the client generating a cursor rather than passing sql

  • is one driver doing a lot of casting/converting between character sets or implicit converting of other types such as dates or timestamps.

and so on

The plan data will be included for completeness, this may inform if there are gross differences in the SQL submitted by the clients.

装迷糊 2024-10-27 03:07:11

下面的内容可能超出了您的想法或您的环境中可接受的范围,但为了以防万一,我会将这个选项放在桌面上。

  1. test.sql 中每个 SELECT 的目标确实是一个简单的 | 分隔的结果文件吗?
  2. 非可移植性(Postgres 特异性)是否可以接受?
  3. 您的后端 Postgres 8.2 或更高版本吗?
  4. 可以接受?

脚本是否会在与数据库后端相同的主机上运行,​​或者从后端内部生成 | 分隔的结果文件(例如共享?)是否 以上所有问题的答案都是,那么您可以将 SELECT ... 语句转换为 COPY ( SELECT ... ) TO E'path-to -结果文件'带有分隔符'|' NULL E'\\N'

The stuff below is probably aiming above and beyond what you have in mind or what is deemed acceptable in your environment, but I'll put the option on the table just in case.

  1. Is the destination of every SELECT in your test.sql truly a simple |-separated results file?
  2. Is non-portability (Postgres-specificity) acceptable?
  3. Is your backend Postgres 8.2 or newer?
  4. Will the script run on the same host as the database backend, or would it be acceptable to generate the |-separated results file(s) from within the backend (e.g. to a share?)

If the answer to all of the above questions is yes, then you can transform your SELECT ... statements to COPY ( SELECT ... ) TO E'path-to-results-file' WITH DELIMITER '|' NULL E'\\N'.

装纯掩盖桑 2024-10-27 03:07:11

另一种方法是使用 ODBC。这是假设 Python ODBC 驱动程序运行良好。

PostgreSQL 具有适用于 Windows 和 Linux 的 ODBC 驱动程序。

An alternative could be to use ODBC. This is assuming that Python ODBC driver performs well.

PostgreSQL has ODBC drivers for both Windows and Linux.

扛起拖把扫天下 2024-10-27 03:07:11

作为主要使用汇编程序进行编程的人,有一件事情是显而易见的。你在开销上浪费了时间,而开销正是需要去掉的。

而不是使用 python,它将自己包装在其他东西中,并与数据库的 C 包装器集成......只需用 C 编写代码。我的意思是,需要多长时间? Postgres 并不难交互(恰恰相反)。 C 是一种简单的语言。您正在执行的操作看起来非常简单。您还可以使用嵌入在 C 中的 SQL,这只是预编译的问题。无需翻译您的想法 - 只需将其与 C 一起编写并使用提供的 ECPG 编译器(阅读 postgres 手册第 29 iirc 章)。

尽可能多地去掉中间接口的东西,去掉中间人并以本机方式与数据库对话。在我看来,在试图让系统变得更简单的过程中,实际上是让它变得比需要的更复杂。当事情变得非常混乱时,我通常会问自己这样一个问题:“我最害怕接触代码的哪一部分?” - 这通常会告诉我哪些地方需要改变。

抱歉胡言乱语,但也许退后一步,呼吸新鲜空气会有所帮助;)

As someone who programmed mostly in assembler, there is one thing that sticks out as obvious. You are losing time in the overhead, and the overhead is what needs to go.

Rather than using python, which wraps itself in something else that integrates with something that is a C wrapper around the DB.... just write the code in C. I mean, how long can it take? Postgres is not hard to interface with (quite the opposite). C is an easy langauge. The operations you are performing seem pretty straightforward. You can also use SQL embedded in C, it's just a matter of a pre-compile. No need to translate what you were thinking - just write it there along with the C and use the supplied ECPG compiler (read postgres manual chapter 29 iirc).

Take out as much of the in-betweeny interfacey stuff as you can, cut out the middle man and get talking to the database natively. It seems to me that in trying to make the system simpler you are actually making it more complicated than it needs to be. When things are getting really messy, I usually ask myself the question "What bit of the code am I most afraid to touch?" - that usually points me to what needs changing.

Sorry for babbling on, but maybe a step backward and some fresh air will help ;)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文