如何在 MySQL 中有效地查找重复的 blob 行？

发布于 2024-07-09 07:03:22 字数 1382 浏览 7 评论 0原文

我有一个表格，

CREATE TABLE data
{
   pk INT PRIMARY KEY AUTO_INCREMENT,
   dt BLOB
};

它大约有 160,000 行，blob 列中有大约 2GB 的数据（每个 blob 平均 14kb）。另一个表有该表的外键。

大约有 3000 个斑点是相同的。所以我想要的是一个查询，它将为我提供一个重新映射表，该表将允许我删除重复项。

这种简单的方法在 30-40k 行上花费了大约一个小时：

SELECT a.pk, MIN(b.pk) 
    FROM data AS a 
    JOIN data AS b
  ON a.dt=b.dt
  WHERE b.pk < a.pk
  GROUP BY a.pk;

由于其他原因，我碰巧有一个具有 blob 大小的表：

CREATE TABLE sizes
(
   fk INT,  // note: non-unique
   sz INT
   // other cols
);

通过为 fk 和另一个 sz 构建索引，直接查询大约需要 24秒，50k 行：

SELECT da.pk,MIN(db.pk) 
  FROM data AS da
  JOIN data AS db
  JOIN sizes AS sa
  JOIN sizes AS sb
  ON
        sa.size=sb.size
    AND da.pk=sa.fk
    AND db.pk=sb.fk
  WHERE
        sb.fk<sa.fk
    AND da.dt=db.dt 
  GROUP BY da.pk;

但是，这是对 da（数据表）进行全表扫描。鉴于命中率应该相当低，我认为索引扫描会更好。考虑到这一点，添加了第三个数据副本作为第五个连接来获取该数据，并损失了大约 3 秒。

好的，对于这个问题：我会比第二个选择更好吗？如果是这样，怎么办？

一个推论是：如果我有一个表，其中键列的使用非常频繁，但其余的应该很少使用，那么我是否最好添加该表的另一个联接以鼓励索引扫描而不是完整扫描表扫描？

#[email protected] 上的 Xgc 指出添加实用程序表，例如大小，但对 fk 有一个独特的约束可能会有很大帮助。一些有趣的触发器和其他可能会让保持最新状态变得不错。

原文

I have a table of the form

CREATE TABLE data
{
   pk INT PRIMARY KEY AUTO_INCREMENT,
   dt BLOB
};

It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.

Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.

The naive approach took about an hour on 30-40k rows:

SELECT a.pk, MIN(b.pk) 
    FROM data AS a 
    JOIN data AS b
  ON a.dt=b.dt
  WHERE b.pk < a.pk
  GROUP BY a.pk;

I happen to have, for other reasons, a table that has the sizes of the blobs:

CREATE TABLE sizes
(
   fk INT,  // note: non-unique
   sz INT
   // other cols
);

By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:

SELECT da.pk,MIN(db.pk) 
  FROM data AS da
  JOIN data AS db
  JOIN sizes AS sa
  JOIN sizes AS sb
  ON
        sa.size=sb.size
    AND da.pk=sa.fk
    AND db.pk=sb.fk
  WHERE
        sb.fk<sa.fk
    AND da.dt=db.dt 
  GROUP BY da.pk;

However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.

OK so for the question: Am I going to get much better than the second select? If so, how?

A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?

Xgc on #[email protected] points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

分享到QQ

分享到微博