SQL：如何根据两个字段查找重复项？

发布于 09-14 09:15 字数 709 浏览 14 评论 0原文

我在 Oracle 数据库表中有一些行，这些行对于两个字段的组合应该是唯一的，但表上没有设置唯一约束，因此我需要使用 SQL 自己查找违反约束的所有行。不幸的是，我微薄的 SQL 技能无法胜任这项任务。

我的表包含三列相关的列：entity_id、station_id 和 obs_year。对于每一行，station_id 和 obs_year 的组合应该是唯一的，我想通过使用 SQL 查询将它们清除来找出是否存在违反此规定的行。

我尝试过以下 SQL（由上一个问题建议），但它没有对我不起作用（我的 ORA-00918 列定义不明确）：

SELECT
entity_id, station_id, obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

有人可以建议我做错了什么，和/或如何解决这个问题吗？

原文

I have rows in an Oracle database table which should be unique for a combination of two fields but the unique constrain is not set up on the table so I need to find all rows which violate the constraint myself using SQL. Unfortunately my meager SQL skills aren't up to the task.

My table has three columns which are relevant: entity_id, station_id, and obs_year. For each row the combination of station_id and obs_year should be unique, and I want to find out if there are rows which violate this by flushing them out with an SQL query.

I have tried the following SQL (suggested by this previous question) but it doesn't work for me (I get ORA-00918 column ambiguously defined):

SELECT
entity_id, station_id, obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

Can someone suggest what I'm doing wrong, and/or how to solve this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

诺曦2024-09-21 09:15:18

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

回复收藏 0 原文

江南月2024-09-21 09:15:18

SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
       t1.station_id = t2.station_id
       AND t1.obs_year = t2.obs_year
       AND t1.RowId <> t2.RowId)

SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
       t1.station_id = t2.station_id
       AND t1.obs_year = t2.obs_year
       AND t1.RowId <> t2.RowId)

回复收藏 0 原文

爺獨霸怡葒院2024-09-21 09:15:18

将初始选择中的 3 个字段更改为

SELECT
t1.entity_id, t1.station_id, t1.obs_year

Change the 3 fields in the initial select to be

SELECT
t1.entity_id, t1.station_id, t1.obs_year

回复收藏 0 原文

笑红尘2024-09-21 09:15:18

重写您的查询

SELECT
t1.entity_id, t1.station_id, t1.obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

我认为不明确的列错误（ORA-00918）是因为您选择的列的名称同时出现在表和子查询中，但您没有指定是否需要它来自 dupes 或来自 mytable（别名为 t1）。

Re-write of your query

SELECT
t1.entity_id, t1.station_id, t1.obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

I think the ambiguous column error (ORA-00918) was because you were selecting columns whose names appeared in both the table and the subquery, but you did not specifiy if you wanted it from dupes or from mytable (aliased as t1).

回复收藏 0 原文

长亭外，古道边2024-09-21 09:15:18

我认为这里的很多解决方案都很麻烦且难以理解，因为我有 3 列主键约束并且需要查找重复项。所以这是一个选择

SELECT id, name, value, COUNT(*) FROM db_name.table_name
GROUP BY id, name, value
HAVING COUNT(*) > 1

I thought a lot of the solutions here were cumbersome and tough to understand since I had a 3 column primary key constraint and needed to find the duplicates. So here's an option

SELECT id, name, value, COUNT(*) FROM db_name.table_name
GROUP BY id, name, value
HAVING COUNT(*) > 1

回复收藏 0 原文

家住魔仙堡2024-09-21 09:15:18

您是否不能创建一个包含唯一约束的新表，然后逐行复制数据，忽略失败？

回复收藏 0 原文

欢烬2024-09-21 09:15:18

您需要在主选择中指定列的表。另外，假设entity_id是mytable的唯一键并且与查找重复项无关，您不应该在dupes子查询中对其进行分组。

尝试：

SELECT t1.entity_id, t1.station_id, t1.obs_year
FROM mytable t1
INNER JOIN (
SELECT station_id, obs_year FROM mytable 
GROUP BY station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

You need to specify the table for the columns in the main select. Also, assuming entity_id is the unique key for mytable and is irrelevant to finding duplicates, you should not be grouping on it in the dupes subquery.

Try:

SELECT t1.entity_id, t1.station_id, t1.obs_year
FROM mytable t1
INNER JOIN (
SELECT station_id, obs_year FROM mytable 
GROUP BY station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

回复收藏 0 原文

月下客2024-09-21 09:15:18

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

by Quassnoi 对于大型桌子来说是最有效的。
我对成本进行了这样的分析：

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
WHERE EXISTS (SELECT 1 from trn_refil_book b Where
       a.dist_code = b.dist_code and a.book_date = b.book_date and a.book_no = b.book_no
       AND a.RowId <> b.RowId)
       ;

给出的成本为 1322341

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
INNER JOIN (
SELECT b.dist_code, b.book_date, b.book_no FROM trn_refil_book b 
GROUP BY b.dist_code, b.book_date, b.book_no HAVING COUNT(*) > 1) c 
ON 
 a.dist_code = c.dist_code and a.book_date = c.book_date and a.book_no = c.book_no
;

给出的成本为 1271699

而

SELECT  dist_code, book_date, book_no
FROM    (
        SELECT  t.dist_code, t.book_date, t.book_no, ROW_NUMBER() OVER (PARTITION BY t.book_date, t.book_no
          ORDER BY t.dist_code) AS rn
        FROM    trn_refil_book t
        ) p
WHERE   p.rn > 1
;

给出的成本为 1021984

该表未建立索引......

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

by Quassnoi is the most efficient for large tables.
I had this analysis of cost :

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
WHERE EXISTS (SELECT 1 from trn_refil_book b Where
       a.dist_code = b.dist_code and a.book_date = b.book_date and a.book_no = b.book_no
       AND a.RowId <> b.RowId)
       ;

gave a cost of 1322341

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
INNER JOIN (
SELECT b.dist_code, b.book_date, b.book_no FROM trn_refil_book b 
GROUP BY b.dist_code, b.book_date, b.book_no HAVING COUNT(*) > 1) c 
ON 
 a.dist_code = c.dist_code and a.book_date = c.book_date and a.book_no = c.book_no
;

gave a cost of 1271699

while

SELECT  dist_code, book_date, book_no
FROM    (
        SELECT  t.dist_code, t.book_date, t.book_no, ROW_NUMBER() OVER (PARTITION BY t.book_date, t.book_no
          ORDER BY t.dist_code) AS rn
        FROM    trn_refil_book t
        ) p
WHERE   p.rn > 1
;

gave a cost of 1021984

The table was not indexed....

回复收藏 0 原文

信愁2024-09-21 09:15:18

  SELECT entity_id, station_id, obs_year
    FROM mytable
GROUP BY entity_id, station_id, obs_year
HAVING COUNT(*) > 1

指定字段以在 SELECT 和 GROUP BY 上查找重复项。

它的工作原理是使用GROUP BY根据指定的列查找与任何其他行匹配的任何行。
HAVING COUNT(*) > 1 表示我们只对查看出现次数超过 1 次的任何行感兴趣（因此是重复的）

  SELECT entity_id, station_id, obs_year
    FROM mytable
GROUP BY entity_id, station_id, obs_year
HAVING COUNT(*) > 1

Specify the fields to find duplicates on both the SELECT and the GROUP BY.

It works by using GROUP BY to find any rows that match any other rows based on the specified Columns.
The HAVING COUNT(*) > 1 says that we are only interested in seeing any rows that occur more than 1 time (and are therefore duplicates)

回复收藏 0 原文

一人独醉2024-09-21 09:15:18

我很惊讶这里没有任何使用 CTE（通用表表达式）的答案，

WITH cte as (
  SELECT 
     ROW_NUMBER() 
     OVER(
     PARTITION BY Last_Name, First_Name order by BIRTHDATE)
        AS RN,
     Employee_number, First_Name, Last_Name, BirthDate,
     SUM(1)
     OVER(
     PARTITION BY Last_Name, First_Name
     ORDER BY BIRTHDATE ROWS BETWEEN UNBOUNDED PRECEDING 
                             AND UNBOUNDED FOLLOWING)
         AS CNT
    FROM
        employment)
   select * from cte where cnt > 1

这不仅会找到重复项（仅在名字和姓氏上），还会告诉您有多少个重复项。

I'm surprised there aren't any answers here that use a CTE (Common Table Expression)

WITH cte as (
  SELECT 
     ROW_NUMBER() 
     OVER(
     PARTITION BY Last_Name, First_Name order by BIRTHDATE)
        AS RN,
     Employee_number, First_Name, Last_Name, BirthDate,
     SUM(1)
     OVER(
     PARTITION BY Last_Name, First_Name
     ORDER BY BIRTHDATE ROWS BETWEEN UNBOUNDED PRECEDING 
                             AND UNBOUNDED FOLLOWING)
         AS CNT
    FROM
        employment)
   select * from cte where cnt > 1

Not only will this find duplicates (on first and last name only), it will tell you how many there are.

回复收藏 0 原文

~没有更多了~