根据位掩码查找mysql中的重复记录

发布于 2024-09-12 11:52:13 字数 1113 浏览 0 评论 0原文

我有一个 mysql 表,用于存储传感器的维护日志。我想设计一个查询,查找出于相同原因修复/维护给定传感器的实例。 (重复出现的问题查找器。)

我的表(简化)如下所示:

id    name     mask
==    ====     ====
11    alpha    0011
12    alpha    0010
13    alpha    0100
14    beta     0001

掩码字段是一个位掩码,其中每个位置代表特定类型的修复。我能够成功地弄清楚如何比较位掩码(每个这个问题),但尝试将其合并到查询中被证明比我想象的更困难。

对于上述示例记录,只有 id 11 和 12 适用,因为它们在第三个掩码位置都有 1

这是我尝试过的方法以及为什么它不起作用:


1. Never finishes...
This query seems to run forever, I don't think it is working the way I want.

SELECT t1.id, t1.name
FROM data t1
LEFT OUTER JOIN data t2
ON (CONV(t1.mask,2,10) & CONV(t2.mask,2,10) > 0)
GROUP BY t1.name
HAVING COUNT(*) >1;

2.查询不完整...
我正在考虑创建一个视图,只解析表中实际有多个条目的传感器。我不知道接下来该去哪里。

SELECT COUNT(t1.name) AS times, t1.name, t1.id, t1.mask
FROM data AS t1
GROUP BY t1.name ASC
HAVING times > 1;

对此有什么建议吗?

I have a mysql table which stores maintenance logs for sensors. I'd like to design a query that finds instances where a given sensor was repaired/maintained for the same reason. (Recurring problem finder.)

My table (simplified) looks like this:

id    name     mask
==    ====     ====
11    alpha    0011
12    alpha    0010
13    alpha    0100
14    beta     0001

The mask field is a bitmask where each position represents a particular type of repair. I was able to successfully figure out how to compare the bitmask (per this question) but trying to incorporate it into a query is proving more difficult than I thought.

Given the above sample records, only id's 11 and 12 apply, since they both have a 1 in the third mask position.

Here's what I've tried and why it didn't work:


1. Never finishes...
This query seems to run forever, I don't think it is working the way I want.

SELECT t1.id, t1.name
FROM data t1
LEFT OUTER JOIN data t2
ON (CONV(t1.mask,2,10) & CONV(t2.mask,2,10) > 0)
GROUP BY t1.name
HAVING COUNT(*) >1;

2. Incomplete query...
I was thinking of creating a view, to only parse the sensors that actually have more than one entry in the table. I wasn't sure where to go from here.

SELECT COUNT(t1.name) AS times, t1.name, t1.id, t1.mask
FROM data AS t1
GROUP BY t1.name ASC
HAVING times > 1;

Any suggestions on this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

ぽ尐不点ル 2024-09-19 11:52:13

由于数据库结构的设计没有考虑到 RDBM 的实际情况(可能不是你的做法,我只是必须说明这一点......),尽管可以编写一个能够完成的查询,但性能总是很差。

Jim 是正确的,查询结果是笛卡尔积。如果该查询以未分组和未过滤的方式返回,您可能会得到 (SELECT POW(COUNT(*), 2) FROM data) 结果。此外,任何形式的外连接都是不必要的,因此标准的内连接就是您想要的(并不是说它应该在性能方面产生影响,只是更合适)。

另外,连接的另一个条件 t1.id != t2.id 也是必要的,以免每个记录与其自身匹配。

SELECT t1.id, t1.name
FROM data t1
JOIN data t2
     ON t1.name = t2.name
     AND t1.id != t2.id     //
WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0
GROUP BY t1.name
HAVING COUNT(*) > 1;

您的不完整查询:

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE t1.name IN (SELECT t2.name FROM data t2 GROUP BY t2.name HAVING COUNT(*) > 1);

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE EXISTS (SELECT 1 FROM data t2 WHERE t2.name = t1.name GROUP BY t2.name HAVING COUNT(*) > 1);

我无法告诉您其中哪一个表现最好。如果 data.name 已建立索引(我希望如此),则任一查询的成本都应该相当低。前者将缓存子选择的副本,而后者将针对索引执行多个查询。

一种非常基本的优化(同时保持整个表结构不变)是将掩码字段转换为无符号整数数据类型,从而节省对 CONV() 的多次调用。

WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0

当然,进一步分解

WHERE t1.mask & t2.mask > 0

数据确实更有意义。 现在,在

id    name     mask
==    ====     ====
11    alpha    1101

成为

id    name    value
==    ====    =====
11    alpha       1
12    alpha       4
13    alpha       8

namevalue 上战略性放置的索引使查询 一个蛋糕

SELECT name, value
FROM data
GROUP BY name, value
HAVING COUNT(*) > 1;

我希望这有帮助。

Since the database structure was not designed with the realities of RDBMs in mind (probably not your doing, I just have to make the point anyway…), the performance will always be poor, though it is possible to write a query that will finish.

Jim is correct in that the query results in a cartesian product. If that query were to be returned ungrouped and unfiltered, you could expect (SELECT POW(COUNT(*), 2) FROM data) results. Also, any form of outer join is unnecessary, so a standard inner join is what you want here (not that it ought to make a difference in terms of performance, it's just more appropriate).

Also another condition of the join, t1.id != t2.id is necessary, lest each record match itself.

SELECT t1.id, t1.name
FROM data t1
JOIN data t2
     ON t1.name = t2.name
     AND t1.id != t2.id     //
WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0
GROUP BY t1.name
HAVING COUNT(*) > 1;

Your incomplete query:

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE t1.name IN (SELECT t2.name FROM data t2 GROUP BY t2.name HAVING COUNT(*) > 1);

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE EXISTS (SELECT 1 FROM data t2 WHERE t2.name = t1.name GROUP BY t2.name HAVING COUNT(*) > 1);

Off the top of my head I can't tell you which of those would perform best. If data.name is indexed (and I would hope it is), the cost for either query ought to be rather low. The former will cache a copy of the subselect, whereas the latter will perform multiple queries against the index.

One very basic optimization (while leaving the table structure as a whole untouched) would be to convert the mask field to an unsigned integer data type, thereby saving many calls to CONV().

WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0

becomes

WHERE t1.mask & t2.mask > 0

Of course, breaking the data down further does make more sense. Instead of storing a bitmask in one record, break out all the ones bits into separate records

id    name     mask
==    ====     ====
11    alpha    1101

would become

id    name    value
==    ====    =====
11    alpha       1
12    alpha       4
13    alpha       8

Now, a strategically placed index on name and value makes the query a piece of cake

SELECT name, value
FROM data
GROUP BY name, value
HAVING COUNT(*) > 1;

I hope that this helps.

南渊 2024-09-19 11:52:13

将掩码位分解为实际列。 RDMB 不喜欢位字段。

Break the mask bits out in real columns. RDMBs don't like bit fields.

掐死时间 2024-09-19 11:52:13

您的连接会产生表与其自身的笛卡尔积。将“t1.name=t2.name”添加到连接中,给出一堆(小得多)的笛卡尔积,每个唯一的名称一个,这将大大加快速度。

Your join results in a cartesian product of the table with itself. Add `t1.name=t2.name' to the join, giving a bunch of (much) smaller cartesian products, one per unique name, which will speed things up considerably.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文