Oracle:识别没有索引的表中的重复项
当我尝试在大型表上创建唯一索引时,出现唯一约束错误。本例中的唯一索引是 4 列的复合键。
除了以下之外,是否有一种有效的方法来识别重复项:
select col1, col2, col3, col4, count(*)
from Table1
group by col1, col2, col3, col4
having count(*) > 1
上面的解释计划显示全表扫描的成本极高,只是想看看是否还有其他方法。
谢谢 !
When I try to create a unique index on a large table, I get a unique contraint error. The unique index in this case is a composite key of 4 columns.
Is there an efficient way to identify the duplicates other than :
select col1, col2, col3, col4, count(*)
from Table1
group by col1, col2, col3, col4
having count(*) > 1
The explain plan above shows full table scan with extremely high cost, and just want to find if there is another way.
Thanks !
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
首先尝试在这四列上创建非唯一索引。这将花费 O(n log n) 时间,但也会将执行
select
所需的时间减少到 O(n log n)。你在这里有点困难——无论你以何种方式对其进行切片,整个表都必须至少读入一次。朴素算法的运行时间为 O(n2),除非查询优化器足够聪明来构建临时索引/表。
Try creating a non-unique index on these four columns first. That will take O(n log n) time, but will also reduce the time needed to perform the
select
to O(n log n).You're in a bit of a bind here -- any way you slice it, the entire table has to be read in at least once. The naïve algorithm runs in O(n2) time, unless the query optimizer is clever enough to build a temporary index/table.
您可以使用 EXCEPTIONS INTO 子句来捕获重复的行。
如果您还没有 EXCEPTIONS 表,请使用提供的脚本创建一个:
现在您可以尝试创建这样的唯一约束
这将失败,但现在您的 EXCEPTIONS 表包含其键包含重复项的所有行的列表,由ROWID。这为您决定如何处理重复项(删除、重新编号等)提供了基础。
编辑
正如其他人所指出的,您必须支付扫描一次表格的费用。这种方法为您提供了一组永久的重复行,而 ROWID 是访问任何给定行的最快方法。
You can use the EXCEPTIONS INTO clause to trap the duplicated rows.
If you don't already have an EXCEPTIONS table create one using the provided script:
Now you can attempt to create a unique constraint like this
This will fail but now your EXCEPTIONS table contains a list of all the rows whose keys contain duplicates, identified by ROWID. That gives you a basis for deciding what to do with the duplicates (delete, renumber, whatever).
edit
As others have noted you have to pay the cost of scanning the table once. This approach gives you a permanent set of the duplicated rows, and ROWID is the fastest way of accessing any given row.
由于这些列上没有索引,因此该查询必须执行全表扫描 - 没有其他方法可以真正做到这一点,除非这些列中的一个或多个已经建立了索引。
您可以将索引创建为非唯一索引,然后运行查询来识别重复行(创建索引后应该非常快)。但我怀疑创建非唯一索引然后运行查询的组合时间是否会比仅运行不带索引的查询少。
Since there is no index on those columns, that query would have to do a full table scan - no other way to do it really, unless one or more of those columns is already indexed.
You could create the index as a non-unique index, then run the query to identify the duplicate rows (which should be very fast once the index is created). But I doubt if the combined time of creating the non-unique index then running the query would be any less than just running the query without the index.
事实上,您需要查找表中每一行的重复项。如果没有索引,就无法有效地做到这一点。
In fact, you need to look for a duplicate of every single row in a table. No way to do this effectively without an index.
不幸的是,我认为没有更快的方法。
I don't think there is a quicker way unfortunately.