删除 MySQL 中的重复行

发布于 2024-09-11 04:22:27 字数 411 浏览 8 评论 0原文

我有一个包含以下字段的表:

id (Unique)
url (Unique)
title
company
site_id

现在,我需要删除具有相同 title、company 和 site_id 的行。一种方法是使用以下 SQL 和脚本 (PHP):

SELECT title, site_id, location, id, count( * ) 
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1

运行此查询后,我可以使用服务器端脚本删除重复项。

但是,我想知道这是否可以仅使用 SQL 查询来完成。

I have a table with the following fields:

id (Unique)
url (Unique)
title
company
site_id

Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):

SELECT title, site_id, location, id, count( * ) 
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1

After running this query, I can remove duplicates using a server side script.

But, I want to know if this can be done only using SQL query.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(28

疯了 2024-09-18 04:22:27

一个非常简单的方法是在 3 列上添加一个 UNIQUE 索引。当您编写 ALTER 语句时,请包含 IGNORE 关键字。像这样:

ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);

这将删除所有重复的行。作为一个额外的好处,未来重复的 INSERT 将会出错。与往常一样,您可能需要在运行类似这样的操作之前进行备份...

编辑:在 MySQL 5.7+ 中不再工作

此功能已在 MySQL 5.6 和 在 MySQL 5.7 中被删除,所以它不起作用。

A really easy way to do this is to add a UNIQUE index on the 3 columns. When you write the ALTER statement, include the IGNORE keyword. Like so:

ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);

This will drop all the duplicate rows. As an added benefit, future INSERTs that are duplicates will error out. As always, you may want to take a backup before running something like this...

Edit: no longer works in MySQL 5.7+

This feature has been deprecated in MySQL 5.6 and removed in MySQL 5.7, so it doesn't work.

噩梦成真你也成魔 2024-09-18 04:22:27

如果您不想更改列属性,则可以使用下面的查询。

由于您有一个具有唯一 ID 的列(例如,auto_increment 列),因此您可以使用它来删除重复项:

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
    AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
    AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);

在 MySQL 中,您可以使用 NULL 安全等于运算符 (又名 “宇宙飞船运算符”):

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND `a`.`title` <=> `b`.`title`
    AND `a`.`company` <=> `b`.`company`
    AND `a`.`site_id` <=> `b`.`site_id`;

If you don't want to alter the column properties, then you can use the query below.

Since you have a column which has unique IDs (e.g., auto_increment columns), you can use it to remove the duplicates:

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
    AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
    AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);

In MySQL, you can simplify it even more with the NULL-safe equal operator (aka "spaceship operator"):

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND `a`.`title` <=> `b`.`title`
    AND `a`.`company` <=> `b`.`company`
    AND `a`.`site_id` <=> `b`.`site_id`;
清泪尽 2024-09-18 04:22:27

MySQL 对于引用要从中删除的表有限制。您可以使用临时表来解决这个问题,例如:

create temporary table tmpTable (id int);

insert  into tmpTable
        (id)
select  id
from    YourTable yt
where   exists
        (
        select  *
        from    YourTabe yt2
        where   yt2.title = yt.title
                and yt2.company = yt.company
                and yt2.site_id = yt.site_id
                and yt2.id > yt.id
        );

delete  
from    YourTable
where   ID in (select id from tmpTable);

来自 Kostanos 在评论中的建议:
对于数据库非常大的情况,上面唯一慢的查询是 DELETE。这个查询可能会更快:

DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id

MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:

create temporary table tmpTable (id int);

insert  into tmpTable
        (id)
select  id
from    YourTable yt
where   exists
        (
        select  *
        from    YourTabe yt2
        where   yt2.title = yt.title
                and yt2.company = yt.company
                and yt2.site_id = yt.site_id
                and yt2.id > yt.id
        );

delete  
from    YourTable
where   ID in (select id from tmpTable);

From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:

DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id
游魂 2024-09-18 04:22:27

删除 MySQL 表上的重复项是一个常见问题,这通常是由于事先缺少避免这些重复项的约束而导致的。但这个常见问题通常伴随着特定的需求……确实需要特定的方法。该方法应该有所不同,例如,数据的大小、应保留的重复条目(通常是第一个或最后一个)、是否有要保留的索引,或者我们是否要执行任何附加操作对重复数据执行的操作。

MySQL 本身也有一些特殊性,例如在执行表 UPDATE 时无法在 FROM 上引用同一个表(这会引发 MySQL 错误 #1093)。可以通过使用带有临时表的内部查询来克服此限制(如上面某些方法所建议的)。但在处理大数据源时,这种内部查询的性能不会特别好。

然而,确实存在一种更好的方法来删除重复项,该方法既高效又可靠,并且可以轻松适应不同的需求。

总体思路是创建一个新的临时表,通常添加唯一约束以避免进一步重复,并将前一个表中的数据插入到新表中,同时处理重复项。这种方法依赖于简单的 MySQL INSERT 查询,创建一个新的约束以避免进一步的重复,并跳过使用内部查询来搜索重复项和应保存在内存中的临时表的需要(因此也适合大数据源)。

这就是它可以实现的方法。假设我们有一个表 employee,包含以下列:

employee (id, first_name, last_name, start_date, ssn)

为了删除具有重复 ssn 列的行,并仅保留找到的第一个条目,以下过程可以遵循:

-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

技术说明

  • 表,其结构与 employee 表完全相同
  • 第 1 行创建一个新的 tmp_eployee 表以避免任何进一步的重复
  • 第 3 行按 id 扫描原始 employee 表,将新的员工条目插入到新的 tmp_eployee 表中,同时忽略重复的条目
  • 第 #4 行重命名表,以便新的 employee 表保存所有没有重复项的条目,并且之前数据的备份副本保留在 backup_employee 表中

⇒ < em>使用这种方法,1.6M 寄存器在不到 200 秒的时间内转换为 6k。

Chetan,如下过程中,您可以快速轻松地删除所有重复项,并通过运行以下命令创建 UNIQUE 约束:

CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;

当然,可以进一步修改此过程以适应删除重复项时的不同需求。下面是一些例子。

✔ 保留最后一个条目而不是第一个条目的变体

有时我们需要保留最后一个重复条目而不是第一个条目。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • 在第 3 行,ORDER BY id DESC 子句使最后一个 ID 的优先级高于其余 ID

✔ 对重复项执行某些任务的变体,例如对找到的重复项进行计数

有时我们需要对找到的重复条目执行一些进一步的处理(例如保留重复项的计数)。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • 在第 3 行,创建了一个新列 n_duplicates
  • 在第 4 行,INSERT INTO ... ON DUPLICATE KEY UPDATE 查询用于在以下情况下执行附加更新:发现重复(在这种情况下,增加计数器)
    INSERT INTO ... ON DUPLICATE KEY UPDATE 查询可用于对找到的重复项执行不同类型的更新。

✔ 重新生成自增字段 id 的变体

有时我们会使用自增字段,为了使索引尽可能紧凑,我们可以利用删除重复项来重新生成自增字段新的临时表。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • 在第 3 行,不是选择表中的所有字段,而是跳过 id 字段,以便数据库引擎自动生成一个新字段。

✔ 其他变体

根据所需的行为,还可以进行许多进一步的修改。例如,以下查询将使用第二个临时表,此外 1) 保留最后一个条目而不是第一个条目; 2) 增加发现的重复项的计数器;另外3)重新生成自增字段id,同时保持之前数据上的条目顺序。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;

Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.

There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.

However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.

The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).

This is how it can be achieved. Given we have a table employee, with the following columns:

employee (id, first_name, last_name, start_date, ssn)

In order to delete the rows with a duplicate ssn column, and keeping only the first entry found, the following process can be followed:

-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

Technical explanation

  • Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table
  • Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates
  • Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries
  • Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table

Using this approach, 1.6M registers were converted into 6k in less than 200s.

Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:

CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;

Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.

✔ Variation for keeping the last entry instead of the first one

Sometimes we need to keep the last duplicated entry instead of the first one.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, the ORDER BY id DESC clause makes the last ID's to get priority over the rest

✔ Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found

Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, a new column n_duplicates is created
  • On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATE query is used to perform an additional update when a duplicate is found (in this case, increasing a counter)
    The INSERT INTO ... ON DUPLICATE KEY UPDATE query can be used to perform different types of updates for the duplicates found.

✔ Variation for regenerating the auto-incremental field id

Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically

✔ Further variations

Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;
清风疏影 2024-09-18 04:22:27

如果 IGNORE 语句不像我的情况那样工作,您可以使用以下语句:

CREATE TABLE your_table_deduped LIKE your_table;


INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
         index2_id;

RENAME TABLE your_table TO your_table_with_dupes;

RENAME TABLE your_table_deduped TO your_table;

#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);

#OPTIONAL
DROP TABLE your_table_with_dupes;

If the IGNORE statement won't work like in my case, you can use the below statement:

CREATE TABLE your_table_deduped LIKE your_table;


INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
         index2_id;

RENAME TABLE your_table TO your_table_with_dupes;

RENAME TABLE your_table_deduped TO your_table;

#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);

#OPTIONAL
DROP TABLE your_table_with_dupes;
默嘫て 2024-09-18 04:22:27

还有另一种解决方案:

DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...

There is another solution :

DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...
毁梦 2024-09-18 04:22:27

一个易于理解且无需主键即可使用的解决方案:

  1. 添加新的布尔列

    alter table mytable add tokeep boolean;
    
  2. 在重复列和新列上添加约束

    alter table mytable 添加约束 Preventdupe unique (mycol1, mycol2, tokeep);
    
  3. 将布尔列设置为 true。由于新的约束,这只会在其中一个重复行上成功

    更新忽略mytable set tokeep = true;
    
  4. 删除尚未标记为要保留的行

    , 此操作仅在重复行之一上成功

    从 mytable 中删除 tokeep 为空的地方;
    
  5. 删除添加的列

    alter table mytable drop tokeep;
    

我建议您保留添加的约束,以便将来防止新的重复项。

A solution that is simple to understand and works with no primary key:

  1. add a new boolean column

    alter table mytable add tokeep boolean;
    
  2. add a constraint on the duplicated columns AND the new column

    alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep);
    
  3. set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint

    update ignore mytable set tokeep = true;
    
  4. delete rows that have not been marked as tokeep

    delete from mytable where tokeep is null;
    
  5. drop the added column

    alter table mytable drop tokeep;
    

I suggest that you keep the constraint you added, so that new duplicates are prevented in the future.

烏雲後面有陽光 2024-09-18 04:22:27

这将删除标题、公司和站点具有相同值的重复行。最后一个出现的将被保留,其余的重复项将被删除(如果您想保留第一个出现并删除其他的,请将 id 上的比较更改为大于例如 t1.id > t2.id)

DELETE t1 FROM tablename t1
INNER JOIN tablename t2 
WHERE 
    t1.id < t2.id AND
    t1.title = t2.title AND
    t1.company=t2.company AND
    t1.site_ID=t2.site_ID;

This will delete the duplicate rows with same values for title, company and site. The last occurrence will be kept and the remaining duplicates will be deleted (if you want to keep the first occurrence and delete the others, change the comparison on id to be greater than e.g. t1.id > t2.id)

DELETE t1 FROM tablename t1
INNER JOIN tablename t2 
WHERE 
    t1.id < t2.id AND
    t1.title = t2.title AND
    t1.company=t2.company AND
    t1.site_ID=t2.site_ID;
悲歌长辞 2024-09-18 04:22:27

如果您有一个包含大量记录的大表,那么上述解决方案将不起作用或花费太多时间。那么我们有不同的解决方案

-- Create temporary table

CREATE TABLE temp_table LIKE table1;

-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);

-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;

-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;

if you have a large table with huge number of records then above solutions will not work or take too much time. Then we have a different solution

-- Create temporary table

CREATE TABLE temp_table LIKE table1;

-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(title, company,site_id);

-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;

-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;
你曾走过我的故事 2024-09-18 04:22:27

我有这个用于 SQLServer 的查询片段,但我认为它可以在其他 DBMS 中使用,只需稍作修改:

DELETE
FROM Table
WHERE Table.idTable IN  (  
    SELECT MAX(idTable)
    FROM idTable
    GROUP BY field1, field2, field3
    HAVING COUNT(*) > 1)

我忘了告诉你,这个查询不会删除重复行中 id 最低的行。如果这对您有用,请尝试以下查询:

DELETE
FROM jobs
WHERE jobs.id IN  (  
    SELECT MAX(id)
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING COUNT(*) > 1)

I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:

DELETE
FROM Table
WHERE Table.idTable IN  (  
    SELECT MAX(idTable)
    FROM idTable
    GROUP BY field1, field2, field3
    HAVING COUNT(*) > 1)

I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:

DELETE
FROM jobs
WHERE jobs.id IN  (  
    SELECT MAX(id)
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING COUNT(*) > 1)
晚雾 2024-09-18 04:22:27

对于所有情况都简单快速:

CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*)  > 1);

DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates);

Simple and fast for all cases:

CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*)  > 1);

DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates);
止于盛夏 2024-09-18 04:22:27

我找到了一个简单的方法。 (保持最新)

DELETE t1 FROM table_name t1 INNER JOIN table_name t2 
WHERE t1.primary_id < t2.primary_id 
AND t1.check_duplicate_col_1 = t2.check_duplicate_col_1 
AND t1.check_duplicate_col_2 = t2.check_duplicate_col_2
...

I found a simple way. (keep latest)

DELETE t1 FROM table_name t1 INNER JOIN table_name t2 
WHERE t1.primary_id < t2.primary_id 
AND t1.check_duplicate_col_1 = t2.check_duplicate_col_1 
AND t1.check_duplicate_col_2 = t2.check_duplicate_col_2
...
庆幸我还是我 2024-09-18 04:22:27

更快的方法是将不同的行插入到临时表中。使用删除,我花了几个小时从 800 万行的表中删除重复项。使用insert和distinct,只花了13分钟。

CREATE TABLE tempTableName LIKE tableName;  
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);  
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;  
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName; 
DROP TABLE tempTableName;  

The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.

CREATE TABLE tempTableName LIKE tableName;  
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);  
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;  
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName; 
DROP TABLE tempTableName;  
夏末染殇 2024-09-18 04:22:27

使用 DELETE JOIN 语句删除重复行
MySQL 为您提供了 DELETE JOIN 语句,您可以使用它快速删除重复的行。

以下语句删除重复行并保留最高的 id:

DELETE t1 FROM contacts t1
    INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;

Delete duplicate rows using DELETE JOIN statement
MySQL provides you with the DELETE JOIN statement that you can use to remove duplicate rows quickly.

The following statement deletes duplicate rows and keeps the highest id:

DELETE t1 FROM contacts t1
    INNER JOIN
contacts t2 WHERE
t1.id < t2.id AND t1.email = t2.email;
挽手叙旧 2024-09-18 04:22:27

从版本 8.0 (2018) 开始,MySQL 终于支持 window功能

窗口函数既方便又高效。这是一个解决方案,演示了如何使用它们来解决此作业。

在子查询中,我们可以使用 ROW_NUMBER() 为表中 column1/column2 组中的每条记录分配位置,按 id 排序。如果没有重复项,记录将获得行号 1。如果存在重复,它们将按 id 升序编号(从 1 开始)。

一旦子查询中的记录被正确编号,外部查询就会删除行号不为 1 的所有记录。

查询:

DELETE FROM tablename
WHERE id IN (
    SELECT id
    FROM (
        SELECT 
            id, 
            ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
        FROM output
    ) t
    WHERE rn > 1
)

As of version 8.0 (2018), MySQL finally supports window functions.

Window functions are both handy and efficient. Here is a solution that demonstrates how to use them to solve this assignment.

In a subquery, we can use ROW_NUMBER() to assign a position to each record in the table within column1/column2 groups, ordered by id. If there is no duplicates, the record will get row number 1. If duplicate exists, they will be numbered by ascending id (starting at 1).

Once records are properly numbered in the subquery, the outer query just deletes all records whose row number is not 1.

Query :

DELETE FROM tablename
WHERE id IN (
    SELECT id
    FROM (
        SELECT 
            id, 
            ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id) rn
        FROM output
    ) t
    WHERE rn > 1
)
油饼 2024-09-18 04:22:27

每当我搜索“从 mysql 中删除重复项”时,我都会访问此页面,但对于我的 theIGNORE 解决方案不起作用,因为我有一个 InnoDB mysql 表,

此代码在任何时候都可以更好地工作

CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;

tableToclean = 您需要清理的表的名称

tableToclean_temp = 临时表表创建和删除

I keep visiting this page anytime I google "remove duplicates form mysql" but for my theIGNORE solutions don't work because I have an InnoDB mysql tables

this code works better anytime

CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;

tableToclean = the name of the table you need to clean

tableToclean_temp = a temporary table created and deleted

々眼睛长脚气 2024-09-18 04:22:27

此解决方案将将重复项移动到一个表中,并将唯一项移动到另一个表中

-- speed up creating uniques table if dealing with many rows
CREATE INDEX temp_idx ON jobs(site_id, company, title, location);

-- create the table with unique rows
INSERT jobs_uniques SELECT * FROM
    (
    SELECT * 
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) > 1
    UNION
    SELECT *
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) = 1
) x

-- create the table with duplicate rows
INSERT jobs_dupes 
SELECT * 
FROM jobs
WHERE id NOT IN
(SELECT id FROM jobs_uniques)

-- confirm the difference between uniques and dupes tables
SELECT COUNT(1)
AS jobs, 
(SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques)
AS sum
FROM jobs

This solution will move the duplicates into one table and the uniques into another.

-- speed up creating uniques table if dealing with many rows
CREATE INDEX temp_idx ON jobs(site_id, company, title, location);

-- create the table with unique rows
INSERT jobs_uniques SELECT * FROM
    (
    SELECT * 
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) > 1
    UNION
    SELECT *
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) = 1
) x

-- create the table with duplicate rows
INSERT jobs_dupes 
SELECT * 
FROM jobs
WHERE id NOT IN
(SELECT id FROM jobs_uniques)

-- confirm the difference between uniques and dupes tables
SELECT COUNT(1)
AS jobs, 
(SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques)
AS sum
FROM jobs
素食主义者 2024-09-18 04:22:27

使用 DELETE JOIN 语句删除重复行:

DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
    t1.id < t2.id AND
    t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;

Delete duplicate rows with the DELETE JOIN statement:

DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
    t1.id < t2.id AND
    t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;
长亭外,古道边 2024-09-18 04:22:27

删除表中的重复记录。

delete from job s 
where rowid < any 
(select rowid from job k 
where s.site_id = k.site_id and 
s.title = k.title and 
s.company = k.company);

或者

delete from job s 
where rowid not in 
(select max(rowid) from job k 
where s.site_id = k.site_id and
s.title = k.title and 
s.company = k.company);

To Delete the duplicate record in a table.

delete from job s 
where rowid < any 
(select rowid from job k 
where s.site_id = k.site_id and 
s.title = k.title and 
s.company = k.company);

or

delete from job s 
where rowid not in 
(select max(rowid) from job k 
where s.site_id = k.site_id and
s.title = k.title and 
s.company = k.company);
南七夏 2024-09-18 04:22:27

这是我使用的,它有效:

create table temp_table like my_table;

t_id 是我独特的列

insert into temp_table (id) select id from my_table GROUP by t_id;
delete from my_table where id not in (select id from temp_table);
drop table temp_table;

Here is what I used, and it works:

create table temp_table like my_table;

t_id is my unique column

insert into temp_table (id) select id from my_table GROUP by t_id;
delete from my_table where id not in (select id from temp_table);
drop table temp_table;
梦里梦着梦中梦 2024-09-18 04:22:27

为了复制具有唯一列的记录,例如 COL1、COL2、COL3 不应被复制(假设我们在表结构中丢失了 3 个唯一列,并且已在表中添加了多个重复条目)

DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3; 
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;

希望能帮助开发人员。

In Order to duplicate records with unique columns, e.g. COL1,COL2, COL3 should not be replicated (suppose we have missed 3 column unique in table structure and multiple duplicate entries have been made into the table)

DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3; 
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;

Hope will help dev.

む无字情书 2024-09-18 04:22:27

如果您想删除其中一个重复项并保留另一个,这是完美的选择。请注意,如果没有子查询,您将收到 #1093 错误。

DELETE FROM table_name
WHERE id IN (
    SELECT * FROM (SELECT n.id FROM table_name n
    WHERE n.column2 != "value"
    GROUP BY n.column HAVING COUNT(n.column ) > 1) x
)

This is perfect if you are trying to delete one of the duplicates and leave the other. Note that without subqueries you would get a #1093 error.

DELETE FROM table_name
WHERE id IN (
    SELECT * FROM (SELECT n.id FROM table_name n
    WHERE n.column2 != "value"
    GROUP BY n.column HAVING COUNT(n.column ) > 1) x
)
凉城 2024-09-18 04:22:27
CREATE TEMPORARY TABLE temp_table AS
SELECT max(option_id) AS max_option_id, title
FROM list_options
GROUP BY title;

DELETE FROM list_options
WHERE option_id NOT IN (
  SELECT max_option_id
  FROM temp_table
);

DROP TEMPORARY TABLE temp_table;
CREATE TEMPORARY TABLE temp_table AS
SELECT max(option_id) AS max_option_id, title
FROM list_options
GROUP BY title;

DELETE FROM list_options
WHERE option_id NOT IN (
  SELECT max_option_id
  FROM temp_table
);

DROP TEMPORARY TABLE temp_table;
離殇 2024-09-18 04:22:27

我有一个表忘记在 id 行中添加主键。虽然 id 有 auto_increment 。但有一天,有人在数据库上重放 mysql bin 日志,其中插入了一些重复的行。

删除重复行

  1. 我通过

    选择唯一的重复行并将其导出

    从 table_name T1 中选择 T1.*
    内连接(选择 count(*) 作为 c, id
        来自表名
        按 ID 分组)T2
    在 T1.id = T2.id 上
    其中T2.c> 1
    按 T​​1.id 分组;
    
  2. 按 id 删除重复行

  3. 插入导出数据中的行。

  4. 然后在id上添加主键

I have a table which forget to add a primary key in the id row. Though is has auto_increment on the id. But one day, one stuff replay the mysql bin log on the database which insert some duplicate rows.

I remove the duplicate row by

  1. select the unique duplicate rows and export them

    select T1.* from table_name T1
    inner join (select count(*) as c, id
        from table_name
        group by id) T2
    on T1.id = T2.id
    where T2.c > 1
    group by T1.id;
    
  2. delete the duplicate rows by id

  3. insert the row from the exported data.

  4. Then add the primary key on id

十年九夏 2024-09-18 04:22:27

这是我正在使用的查询,它的工作原理与 gem 类似,

这是查询,

delete from products where products_id in ( select * from ( select products_id from products group by products_red having count(*) >1)_tmp)

它会在没有任何临时表的情况下删除,

好处:如果有 10000 条记录,每条记录为 2 到 5,那么您需要运行5次清除所有重复,
缺点:如果数据有10000个重复,那么每次需要运行10000次才能清除一个,

请在考虑产品数和重复数后使用此方法

this is the query that i am using it and works like gem,

here is the query,

delete from products where products_id in ( select * from ( select products_id from products group by products_red having count(*) >1)_tmp)

this deletes without any temporary tables,

benefit: if there are 10000 records with each 2 to 5 then you need to run 5 times to clear all duplicated,
Drawback: if the data is with 10000 duplicates each, then you need to run 10000 times to clear one each time,

use this after considering the product count and the duplicate count

夏尔 2024-09-18 04:22:27

我想更具体地说明我删除了哪些记录,所以这是我的解决方案:

delete
from jobs c1
where not c1.location = 'Paris'
and  c1.site_id > 64218
and exists 
(  
select * from jobs c2 
where c2.site_id = c1.site_id
and   c2.company = c1.company
and   c2.location = c1.location
and   c2.title = c1.title
and   c2.site_id > 63412
and   c2.site_id < 64219
)

I like to be a bit more specific as to which records I delete so here is my solution:

delete
from jobs c1
where not c1.location = 'Paris'
and  c1.site_id > 64218
and exists 
(  
select * from jobs c2 
where c2.site_id = c1.site_id
and   c2.company = c1.company
and   c2.location = c1.location
and   c2.title = c1.title
and   c2.site_id > 63412
and   c2.site_id < 64219
)
聆听风音 2024-09-18 04:22:27

您可以轻松地从此代码中删除重复的记录。

$qry = mysql_query("SELECT * from cities");
while($qry_row = mysql_fetch_array($qry))
{
$qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'");

if(mysql_num_rows($qry2) > 1){
    while($row = mysql_fetch_array($qry2)){
        $city_arry[] = $row;

        }

    $total = sizeof($city_arry) - 1;
        for($i=1; $i<=$total; $i++){


            mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'");

            }
    }
    //exit;
}

You can easily delete the duplicate records from this code..

$qry = mysql_query("SELECT * from cities");
while($qry_row = mysql_fetch_array($qry))
{
$qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'");

if(mysql_num_rows($qry2) > 1){
    while($row = mysql_fetch_array($qry2)){
        $city_arry[] = $row;

        }

    $total = sizeof($city_arry) - 1;
        for($i=1; $i<=$total; $i++){


            mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'");

            }
    }
    //exit;
}
安稳善良 2024-09-18 04:22:27

我必须对文本字段执行此操作,并遇到了索引 100 字节的限制。

我通过添加一列、对字段进行 md5 哈希并进行更改来解决这个问题。

ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ;
UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`))
ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);

I had to do this with text fields and came across the limit of 100 bytes on the index.

I solved this by adding a column, doing a md5 hash of the fields, and the doing the alter.

ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ;
UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`))
ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文