如何删除重复的条目?

发布于 2024-08-11 07:51:36 字数 150 浏览 4 评论 0原文

我必须向现有表添加唯一约束。这很好,只是表已经有数百万行,并且许多行违反了我需要添加的唯一约束。

删除有问题的行的最快方法是什么?我有一个 SQL 语句可以查找重复项并删除它们,但它需要很长时间才能运行。还有其他方法可以解决这个问题吗?也许备份表,然后在添加约束后恢复?

I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.

What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

清浅ˋ旧时光 2024-08-18 07:51:36

其中一些方法看起来有点复杂,我通常这样做:

给定表 table,想要在 (field1, field2) 上唯一它保持最大 field3 的行:

DELETE FROM table USING table alias 
  WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
    table.max_field < alias.max_field

例如,我有一个表 user_accounts,我想对电子邮件添加唯一约束,但我有一些重复项。还说我想保留最近创建的一个(重复项中的最大 id)。

DELETE FROM user_accounts USING user_accounts ua2
  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
  • 注意 - USING 不是标准 SQL,它是 PostgreSQL 扩展(但非常有用),但最初的问题特别提到了 PostgreSQL。

Some of these approaches seem a little complicated, and I generally do this as:

Given table table, want to unique it on (field1, field2) keeping the row with the max field3:

DELETE FROM table USING table alias 
  WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
    table.max_field < alias.max_field

For example, I have a table, user_accounts, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).

DELETE FROM user_accounts USING user_accounts ua2
  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
  • Note - USING is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.
牵强ㄟ 2024-08-18 07:51:36

例如,您可以:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

For example you could:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;
游魂 2024-08-18 07:51:36

您还可以在截断表后将唯一行重新插入到同一个表中,而不是创建新表。 在一笔交易中完成这一切。

此方法仅在需要从整个表中删除大量行的情况下有用。对于少数重复项,请使用普通的DELETE

您提到了数百万行。为了使操作快速,您需要分配足够的会话的临时缓冲区。在当前会话中使用任何临时缓冲区之前,必须调整该设置。找出表的大小:

SELECT pg_size_pretty(pg_relation_size('tbl'));

temp_buffers 设置为至少比该大小稍高一些。

SET temp_buffers = 200MB;   -- example value

BEGIN;

CREATE TEMP TABLE t_tmp AS  -- retains temp for duration of session
SELECT DISTINCT * FROM tbl  -- DISTINCT folds duplicates
ORDER  BY id;               -- optionally "cluster" data

TRUNCATE tbl;

INSERT INTO tbl
SELECT * FROM t_tmp;        -- retains order (implementation detail)

COMMIT;

如果存在依赖对象,此方法可能优于创建新表。视图、索引、外键或引用表的其他对象。 TRUNCATE 让你从一个干净的开始无论如何,都会slate(后台有新文件),并且比使用大表的DELETE FROM tbl快得多(DELETE实际上对于小表会更快)。

对于大型表,删除索引和外键 (FK)、重新填充表并重新创建这些对象通常更快。当然,就 FK 约束而言,您必须确保新数据有效,否则在尝试创建 FK 时会遇到异常。

请注意,TRUNCATE 需要比 DELETE 更积极的锁定。对于具有大量并发负载的表来说,这可能是一个问题。但与完全删除并更换桌子相比,它的破坏性仍然较小。

如果 TRUNCATE 不是一个选项,或者通常对于中小型表,则可以采用类似的技术 数据修改 CTE (Postgres 9.1+):

WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
ORDER  BY id; -- optionally "cluster" data while being at it.

对于大表来说速度较慢,因为 < code>TRUNCATE 在那里更快。但对于小表来说可能会更快(而且更简单!)。

如果您根本没有依赖对象,您可以创建一个新表并删除旧表,但通过这种通用方法您几乎没有任何收获。

对于无法容纳可用 RAM 的非常大的表,创建表会快得多。您必须权衡这一点与依赖对象可能出现的问题/开销。

Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction.

This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain DELETE.

You mentioned millions of rows. To make the operation fast you want to allocate enough temporary buffers for the session. The setting has to be adjusted before any temp buffer is used in your current session. Find out the size of your table:

SELECT pg_size_pretty(pg_relation_size('tbl'));

Set temp_buffers at least a bit above that.

SET temp_buffers = 200MB;   -- example value

BEGIN;

CREATE TEMP TABLE t_tmp AS  -- retains temp for duration of session
SELECT DISTINCT * FROM tbl  -- DISTINCT folds duplicates
ORDER  BY id;               -- optionally "cluster" data

TRUNCATE tbl;

INSERT INTO tbl
SELECT * FROM t_tmp;        -- retains order (implementation detail)

COMMIT;

This method can be superior to creating a new table if depending objects exist. Views, indexes, foreign keys or other objects referencing the table. TRUNCATE makes you begin with a clean slate anyway (new file in the background) and is much faster than DELETE FROM tbl with big tables (DELETE can actually be faster with small tables).

For big tables, it is regularly faster to drop indexes and foreign keys (FK), refill the table and recreate these objects. As far as FK constraints are concerned you have to be certain the new data is valid, of course, or you'll run into exceptions on trying to create the FK.

Note that TRUNCATE requires more aggressive locking than DELETE. This may be an issue for tables with heavy, concurrent load. But it's still less disruptive than to drop and replace the table completely.

If TRUNCATE is not an option or generally for small to medium tables there is a similar technique with a data-modifying CTE (Postgres 9.1+):

WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
ORDER  BY id; -- optionally "cluster" data while being at it.

Slower for big tables, because TRUNCATE is faster there. But may be faster (and simpler!) for small tables.

If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.

For very big tables that would not fit into available RAM, creating a new table will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.

停顿的约定 2024-08-18 07:51:36

您可以使用 oid 或 ctid,它们通常是表中的“不可见”列:

DELETE FROM table
 WHERE ctid NOT IN
  (SELECT MAX(s.ctid)
    FROM table s
    GROUP BY s.column_has_be_distinct);

You can use oid or ctid, which is normally a "non-visible" columns in the table:

DELETE FROM table
 WHERE ctid NOT IN
  (SELECT MAX(s.ctid)
    FROM table s
    GROUP BY s.column_has_be_distinct);
绝情姑娘 2024-08-18 07:51:36

PostgreSQL 的窗口函数可以方便地解决这个问题。

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

请参阅删除重复项

The PostgreSQL window function is handy for this problem.

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

See Deleting duplicates.

∝单色的世界 2024-08-18 07:51:36

用于删除重复项的通用查询:

DELETE FROM table_name
WHERE ctid NOT IN (
  SELECT max(ctid) FROM table_name
  GROUP BY column1, [column 2, ...]
);

ctid 列是每个表都可用的特殊列,但除非特别提及,否则不可见。 ctid 列值对于表中的每一行来说被认为是唯一的。请参阅 PostgreSQL 系统列,了解有关 ctid 的更多信息

Generalized query to delete duplicates:

DELETE FROM table_name
WHERE ctid NOT IN (
  SELECT max(ctid) FROM table_name
  GROUP BY column1, [column 2, ...]
);

The column ctid is a special column available for every table but not visible unless specifically mentioned. The ctid column value is considered unique for every row in a table. See PostgreSQL system columns to learn more about ctid.

洒一地阳光 2024-08-18 07:51:36

来自旧的 postgresql.org 邮件列表

create table test ( a text, b text );

唯一值

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

重复值

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

又一个双重复

insert into test values ( 'x', 'y');

select oid, a, b from test;

选择重复行

select o.oid, o.a, o.b from test o
    where exists ( select 'x'
                   from test i
                   where     i.a = o.a
                         and i.b = o.b
                         and i.oid < o.oid
                 );

删除重复行

注意:PostgreSQL 不支持别名
from 子句中提到的表
的删除。

delete from test
    where exists ( select 'x'
                   from test i
                   where     i.a = test.a
                         and i.b = test.b
                         and i.oid < test.oid
             );

From an old postgresql.org mailing list:

create table test ( a text, b text );

Unique values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

Duplicate values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

One more double duplicate

insert into test values ( 'x', 'y');

select oid, a, b from test;

Select duplicate rows

select o.oid, o.a, o.b from test o
    where exists ( select 'x'
                   from test i
                   where     i.a = o.a
                         and i.b = o.b
                         and i.oid < o.oid
                 );

Delete duplicate rows

Note: PostgreSQL dosn't support aliases on
the table mentioned in the from clause
of a delete.

delete from test
    where exists ( select 'x'
                   from test i
                   where     i.a = test.a
                         and i.b = test.b
                         and i.oid < test.oid
             );
楠木可依 2024-08-18 07:51:36

此函数删除重复项而不删除索引,并对任何表执行此操作。

用法:选择remove_duplicates('mytable');

---
--- remove_duplicates(tablename) removes duplicate records from a table (convert from set to unique set)
---
CREATE OR REPLACE FUNCTION remove_duplicates(text) RETURNS void AS $
DECLARE
  tablename ALIAS FOR $1;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT * FROM ' || tablename || ');';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$ LANGUAGE plpgsql;

This function removes duplicates without removing indexes and does it to any table.

Usage: select remove_duplicates('mytable');

---
--- remove_duplicates(tablename) removes duplicate records from a table (convert from set to unique set)
---
CREATE OR REPLACE FUNCTION remove_duplicates(text) RETURNS void AS $
DECLARE
  tablename ALIAS FOR $1;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT * FROM ' || tablename || ');';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$ LANGUAGE plpgsql;
假情假意假温柔 2024-08-18 07:51:36

我刚刚使用Erwin Brandstetter的答案成功地删除了连接表中的重复项(一张缺少自己的主 ID 的表),但发现有一个重要的警告。

包括 ON COMMIT DROP 意味着临时表将在事务结束时被删除。对我来说,这意味着当我插入临时表时,临时表不再可用

我刚刚做了 CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl; 并且一切正常。

临时表确实会在会话结束时被删除。

I just used Erwin Brandstetter's answer successfully to remove duplicates in a join table (a table lacking its own primary IDs), but found that there's one important caveat.

Including ON COMMIT DROP means the temporary table will get dropped at the end of the transaction. For me, that meant the temporary table was no longer available by the time I went to insert it!

I just did CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl; and everything worked fine.

The temporary table does get dropped at the end of the session.

一身仙ぐ女味 2024-08-18 07:51:36
DELETE FROM table
  WHERE something NOT IN
    (SELECT     MAX(s.something)
      FROM      table As s
      GROUP BY  s.this_thing, s.that_thing);
DELETE FROM table
  WHERE something NOT IN
    (SELECT     MAX(s.something)
      FROM      table As s
      GROUP BY  s.this_thing, s.that_thing);
落花浅忆 2024-08-18 07:51:36

如果您只有一个或几个重复的条目,并且它们确实重复(即它们出现两次),则可以使用“隐藏”ctid 列,如上面建议的,与LIMIT一起:

DELETE FROM mytable WHERE ctid=(SELECT ctid FROM mytable WHERE […] LIMIT 1);

这将仅删除所选行中的第一行。

If you have only one or a few duplicated entries, and they are indeed duplicated (that is, they appear twice), you can use the "hidden" ctid column, as proposed above, together with LIMIT:

DELETE FROM mytable WHERE ctid=(SELECT ctid FROM mytable WHERE […] LIMIT 1);

This will delete only the first of the selected rows.

飘过的浮云 2024-08-18 07:51:36

首先,您需要决定保留哪些“重复项”。如果所有列都相等,那么您可以删除其中任何列...但也许您只想保留最新的列,或者其他一些标准?

最快的方法取决于您对上述问题的回答,以及表中重复项的百分比。如果你扔掉 50% 的行,你最好执行 CREATE TABLE ... AS SELECT DISTINCT ... FROM ... ;,如果你删除 1% 的行,使用 DELETE 效果更好。

同样对于像这样的维护操作,通常最好将 work_mem 设置为 RAM 的大部分:运行 EXPLAIN,检查排序/散列的数量 N,并将 work_mem 设置为 RAM / 2 / N使用大量内存;这对速度有好处。只要您只有一个并发连接...

First, you need to decide on which of your "duplicates" you will keep. If all columns are equal, OK, you can delete any of them... But perhaps you want to keep only the most recent, or some other criterion?

The fastest way depends on your answer to the question above, and also on the % of duplicates on the table. If you throw away 50% of your rows, you're better off doing CREATE TABLE ... AS SELECT DISTINCT ... FROM ... ;, and if you delete 1% of the rows, using DELETE is better.

Also for maintenance operations like this, it's generally good to set work_mem to a good chunk of your RAM: run EXPLAIN, check the number N of sorts/hashes, and set work_mem to your RAM / 2 / N. Use lots of RAM; it's good for speed. As long as you only have one concurrent connection...

黑白记忆 2024-08-18 07:51:36

我正在使用 PostgreSQL 8.4。当我运行建议的代码时,我发现它不是
实际上删除重复项。在运行一些测试时,我发现添加
“DISTINCT ON (duplicate_column_name)”和“ORDER BY Continuous_column_name”就达到了目的。我不是 SQL 专家,我在 PostgreSQL 8.4 SELECT...DISTINCT 文档中找到了这个。

CREATE OR REPLACE FUNCTION remove_duplicates(text, text) RETURNS void AS $
DECLARE
  tablename ALIAS FOR $1;
  duplicate_column ALIAS FOR $2;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT ON (' || duplicate_column || ') * FROM ' || tablename || ' ORDER BY ' || duplicate_column || ' ASC);';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$ LANGUAGE plpgsql;

I'm working with PostgreSQL 8.4. When I ran the proposed code, I found that it was not
actually removing the duplicates. In running some tests, I found that adding the
"DISTINCT ON (duplicate_column_name)" and the "ORDER BY duplicate_column_name" did the trick. I'm no SQL guru, I found this in the PostgreSQL 8.4 SELECT...DISTINCT doc.

CREATE OR REPLACE FUNCTION remove_duplicates(text, text) RETURNS void AS $
DECLARE
  tablename ALIAS FOR $1;
  duplicate_column ALIAS FOR $2;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT ON (' || duplicate_column || ') * FROM ' || tablename || ' ORDER BY ' || duplicate_column || ' ASC);';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$ LANGUAGE plpgsql;
旧街凉风 2024-08-18 07:51:36

这工作得非常好而且非常快:

CREATE INDEX otherTable_idx ON otherTable( colName );
CREATE TABLE newTable AS select DISTINCT ON (colName) col1,colName,col2 FROM otherTable;

This works very nicely and is very quick:

CREATE INDEX otherTable_idx ON otherTable( colName );
CREATE TABLE newTable AS select DISTINCT ON (colName) col1,colName,col2 FROM otherTable;
失与倦" 2024-08-18 07:51:36
DELETE FROM tablename
WHERE id IN (SELECT id
    FROM (SELECT id,ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                 FROM tablename) t
          WHERE t.rnum > 1);

按列删除重复项并保留 ID 最低的行。该模式取自 postgres wiki

使用 CTE,您可以获得上述内容的更具可读性的版本通过这个

WITH duplicate_ids as (
    SELECT id, rnum 
    FROM num_of_rows
    WHERE rnum > 1
),
num_of_rows as (
    SELECT id, 
        ROW_NUMBER() over (partition BY column1, 
                                        column2, 
                                        column3 ORDER BY id) AS rnum
        FROM tablename
)
DELETE FROM tablename 
WHERE id IN (SELECT id from duplicate_ids)
DELETE FROM tablename
WHERE id IN (SELECT id
    FROM (SELECT id,ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                 FROM tablename) t
          WHERE t.rnum > 1);

Delete duplicates by column(s) and keep row with lowest id. The pattern is taken from the postgres wiki

Using CTEs you can achieve a more readable version of the above through this

WITH duplicate_ids as (
    SELECT id, rnum 
    FROM num_of_rows
    WHERE rnum > 1
),
num_of_rows as (
    SELECT id, 
        ROW_NUMBER() over (partition BY column1, 
                                        column2, 
                                        column3 ORDER BY id) AS rnum
        FROM tablename
)
DELETE FROM tablename 
WHERE id IN (SELECT id from duplicate_ids)
感情废物 2024-08-18 07:51:36
CREATE TABLE test (col text);
INSERT INTO test VALUES
 ('1'),
 ('2'), ('2'),
 ('3'),
 ('4'), ('4'),
 ('5'),
 ('6'), ('6');
DELETE FROM test
 WHERE ctid in (
   SELECT t.ctid FROM (
     SELECT row_number() over (
               partition BY col
               ORDER BY col
               ) AS rnum,
            ctid FROM test
       ORDER BY col
     ) t
    WHERE t.rnum >1);
CREATE TABLE test (col text);
INSERT INTO test VALUES
 ('1'),
 ('2'), ('2'),
 ('3'),
 ('4'), ('4'),
 ('5'),
 ('6'), ('6');
DELETE FROM test
 WHERE ctid in (
   SELECT t.ctid FROM (
     SELECT row_number() over (
               partition BY col
               ORDER BY col
               ) AS rnum,
            ctid FROM test
       ORDER BY col
     ) t
    WHERE t.rnum >1);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文