使用 Psycopg2 运行其他事务时,CLUSTER 不会缩小表大小

发布于 2024-10-11 13:34:32 字数 1641 浏览 1 评论 0原文

我们正在运行一个 python 进程,它运行这个存储过程,它将文件从某个目录导入到 postgres 数据库。这些文件首先导入到内存表中,然后导入到磁盘表中。内存表的实际大小永远不应超过 30 MB。随着该表不断更新,表的大小会增加(由于死元组)。为了控制情况,我们需要对表执行 CLUSTER 操作。我正在使用 psycopg2 模块来运行存储过程并对表进行 CLUSTER,但如果导入过程正在运行,则表的大小永远不会减小。但是,如果我停止导入过程并运行 CLUSTER,那么表的大小就会减小。由于性能原因,我应该能够运行 CLUSTER 命令而不停止导入过程。

我尝试了手动提交,ISOLATION_LEVEL_AUTOCOMMIT,但这些都不起作用。 下面是该过程的示例代码 -

while True:
    -get the filenames in directory
    for filpath in  filenames:
        conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
        cursor = conn.cursor()
        # Calls a postgresql function that reads a file and imports it into 
        # a table via INSERT statements and DELETEs any records that have the 
        # same unique key as any of the records in the file.
        cursor.execute("SELECT import('%s', '%s');" % (filepath, str(db_timestamp))
        conn.commit()
        cursor.close()
        conn.close()
        os.remove(get_media_path(fname))

使用类似的 conn 对象,我想每小时运行一次 CLUSTER 命令 -

conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
cursor.execute("CLUSTER table_name")
conn.commit()
cursor.close()
conn.close()

另外,我尝试设置 -

conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)

另一条信息 - 我所有这些都在 django 环境中运行。我无法使用 django 连接对象来执行该任务,因为 - django 无法释放与我的线程代码的连接,很快数据库就停止接受连接。这种混合环境是否会对 psycopg 产生影响?

很少观察到 -

  1. 在导入进程运行时运行 CLUSTER 命令 - 大小不会下降
  2. 当我停止导入进程然后运行 ​​CLUSTER 时 - 大小确实下降
  3. 当我停止导入进程并重新开始导入进程时,然后运行CLUSTER 命令 - 大小确实下降

对这个问题有什么想法吗?

We are running a python process which runs this stored procedure, which import files from a certain directory to the postgres database. These files are first get imported to a in-memory table and then to the disk table. The actual size of the in-memory table should never really grow beyond 30 MB. As this table is constantly updated, the size of the table grows (because of dead tuples). To keep things in check, we need to perform CLUSTER operation on the table. I am using psycopg2 module to run stored procedure adn CLUSTER the table, but if the import process is running the size of the table never goes down. But If I stop the import process and run CLUSTER then the size of the table goes down. Because of the performance reason, I should be able to run CLUSTER command without stopping the import procedure.

I tried manual commit, ISOLATION_LEVEL_AUTOCOMMIT but none of this has worked.
Below is the sample code of the process -

while True:
    -get the filenames in directory
    for filpath in  filenames:
        conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
        cursor = conn.cursor()
        # Calls a postgresql function that reads a file and imports it into 
        # a table via INSERT statements and DELETEs any records that have the 
        # same unique key as any of the records in the file.
        cursor.execute("SELECT import('%s', '%s');" % (filepath, str(db_timestamp))
        conn.commit()
        cursor.close()
        conn.close()
        os.remove(get_media_path(fname))

With the similar conn object, I want to run CLUSTER command once an hour -

conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
cursor.execute("CLUSTER table_name")
conn.commit()
cursor.close()
conn.close()

Also, I tried setting -

conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)

Another piece of information -
I have all this running inside django environment. I could not use django connection objects to do the task because - django could not release connections with my threading code and soon the database stopped accepting connections.Does this mixed environment might have an effect on psycopg?

Few observations -

  1. Running the CLUSTER command when import process is running - size doesn't go down
  2. When I stop the import process and then run CLUSTER - size does go down
  3. When I stop the import process and start import process back, and after that run CLUSTER command - size does go down

Any thoughts on the problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

荭秂 2024-10-18 13:34:32

从手册中:

当一个表被集群时,
获取 ACCESS EXCLUSIVE 锁
它。这可以防止任何其他数据库
操作(读和写)
从在桌子上操作直到
集群完成。

您确定必须每小时进行一次集群吗?有了更好的填充因子和自动真空,你的表就不会增长那么多,表中也不会有死元组。

From the manual:

When a table is being clustered, an
ACCESS EXCLUSIVE lock is acquired on
it. This prevents any other database
operations (both reads and writes)
from operating on the table until the
CLUSTER is finished.

Are you sure you have to CLUSTER every hour? With a better fillfactor and autovacuum, your table won't grow that much and you won't have dead tuples in the table.

ι不睡觉的鱼゛ 2024-10-18 13:34:32

好的 - 我找到了罪魁祸首。

问题是集群或真空没有删除死元组,因为当我们直接在 django 环境中使用 pstcopg2 时,发生了一些奇怪的交互。隔离 psycopg 代码并从导入过程中删除 django 相关代码后,一切正常。这解决了问题,现在我可以在不停止导入过程的情况下对其进行清理或集群。

OK - I found the culprit.

The problem was somehow the cluster or vacuum were not deleting the dead tuples, because some weird interaction was happening when we using pstcopg2 directly in django environment. After isolating the psycopg code and removing the django related code from import-process, everything worked fine. This solved the problem and now I can vacuum or cluster it without stopping the import-process.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文