postgresql CLUSTER 命令不清除死元组

发布于 2024-10-11 15:05:04 字数 1716 浏览 13 评论 0原文

我们有一个后台进程（无限循环中的 Linux 守护进程），它自动从放置在某个目录中的 csv 文件中获取所有行并将它们导入到表中。该守护进程一一处理目录中出现的任何文件，用 python 编写，并使用 psycopg2 连接到我们的 postgresql 数据库。

该过程使用 INSERT 语句导入这些记录，但首先删除与 csv 文件中的任何记录具有相同唯一键的任何表记录。通常，该过程是为插入的每条记录删除一条记录。因此，当这个守护进程在后台运行时，它会先删除行，然后插入行。每次处理一个文件时，它都会专门提交事务，关闭游标，然后关闭连接。

我们希望定期（每天两次）运行 CLUSTER 来删除死元组并将表保持在可管理的磁盘大小上。

但是，此过程中的某些操作会阻止 CLUSTER 命令删除进程运行时要删除的所有记录的死元组。我们知道会发生这种情况，因为如果我们在进程运行时运行 CLUSTER，包含此导入数据的表的磁盘大小将不会减少，并且 pg_stat_user_tables 将显示许多死元组。

如果我们停止进程然后运行 CLUSTER，表的磁盘大小将急剧减小，并且 pg_stat_user_tables 将报告所有死元组都消失了。

奇怪的是，我们每次处理每个文件时都会提交事务并关闭连接，所以我不知道是什么不允许在进程运行时删除死元组。

同样奇怪的是，如果我们停止进程，然后再次启动进程，然后执行 CLUSTER，它将删除由守护进程的先前运行创建的所有死元组；但任何后续的 CLUSTER 调用都不会清除守护进程当前运行所创建的任何死元组（当然，它仍在运行）。

因此，即使我们已经提交了事务并关闭了与创建这些死元组的 postgres 的所有连接，仍然在维护与死元组的某种链接，直到进程停止。 pg_locks 不报告任何打开的锁，也没有报告正在运行的事务，因此它看起来不像是锁或打开的事务问题。

归根结底，这会阻止我们定期在表上运行 CLUSTER，从而使其不会不断增长。

我确信对此有一个简单的答案，但我无法在任何地方找到它。该过程的一些框架代码如下。这确实是一个简单的过程，所以我不知道这里发生了什么。任何指导将不胜感激。

while True:
    l = [(get_modified_time(fname), fname) for fname in os.listdir('/tmp/data')]
    l.sort()

    for (t, fname) in l:
        conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
        cursor = conn.cursor()

        # Calls a postgresql function that reads a file and imports it into 
        # a table via INSERT statements and DELETEs any records that have the 
        # same unique key as any of the records in the file.
        cursor.execute("SELECT import('%s', '%s');" % (fname, t))

        conn.commit()
        cursor.close()
        conn.close()

        os.remove(get_full_pathname(fname))

        time.sleep(0.100)

原文

We have a background process (linux daemon in an infinite loop) that automatically takes all lines from csv files that are placed in a certain directory and imports them into a table. The daemon processes any files that appear in the directory one by one, is written in python, and uses psycopg2 to connect to our postgresql database.

That process imports those records using INSERT statements, but first DELETES any table records that have the same unique key as any of the records in the csv file. Generally the process is DELETING a record for every record it INSERTS. So as this daemon is running in the background it is DELETING and then INSERTING rows. Every time it processes one file it specifically commits the transaction, closes the cursor, and then closes the connection.

Periodically (twice a day) we want to run CLUSTER to remove the dead tuples and keep the table to a manageable on disk size.

However, something in this process is stopping the CLUSTER command from removing the dead tuples for all the records that are being deleted as the process is running. We know this happening because if we run CLUSTER while the process is running, the on disk size of the table containing this imported data will not decrease and pg_stat_user_tables will show many dead tuples.

If we stop the process and then run CLUSTER, the on disk size of the table will decrease dramatically and and pg_stat_user_tables will report that all of the dead tuples are gone.

What's strange is we are committing the transaction and closing the connections every time we process each file, so I have no idea what is not allowing the dead tuples to be removed while the process is running.

Also strange, is that if we stop the process, then start the process again, then do a CLUSTER, it will remove all of the dead tuples created by the previous run of the daemon process; but any subsequent calls of CLUSTER will not clear any dead tuples created by the current run of the daemon process (while it is still running of course).

So something is maintaining some kind of link to the dead tuples until the process is stopped, even though we have committed the transaction and closed all connections to postgres that created those dead tuples. pg_locks does not report any open locks and no running transactions are reported, so it doesn't seem like its a lock or open transaction issue.

At the end of the day, this is stopping us from periodically running CLUSTER on the table so that it doesn't keep growing and growing.

I'm sure there is a simple answer to this, but I can't find it anywhere. Some skeleton code for the process is below. It really is a simple process so I have no idea what is going on here. Any guidance would be greatly appreciated.

while True:
    l = [(get_modified_time(fname), fname) for fname in os.listdir('/tmp/data')]
    l.sort()

    for (t, fname) in l:
        conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
        cursor = conn.cursor()

        # Calls a postgresql function that reads a file and imports it into 
        # a table via INSERT statements and DELETEs any records that have the 
        # same unique key as any of the records in the file.
        cursor.execute("SELECT import('%s', '%s');" % (fname, t))

        conn.commit()
        cursor.close()
        conn.close()

        os.remove(get_full_pathname(fname))

        time.sleep(0.100)

分享到QQ

分享到微博