选择“第一个” (由 ORDER BY 确定) SQLite 中接近重复的行(由 GROUP BY、HAVING、COUNT 确定)中的行
我有一个有点超出我能力范围的问题(我真的非常高兴我是 Beta),涉及重复项(因此 GROUP BY
、HAVING
、COUNT
),通过将解决方案保留在 SQLite 附带的标准函数内来复合。我正在使用 Python 中的 sqlite3
模块。
示例表工作人员,列:(
* ID: integer, auto-incrementing
* ColA: integer
* ColB: varchar(20)
* UserType: varchar(20)
* LoadMe: Boolean
是的,SQLite 的数据类型是名义上的)
我的数据表工作人员一开始看起来像:
ID ColA ColB UserType LoadMe
1 1 a Alpha 0
2 1 b Beta 0
3 2 a Alpha 0
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 0
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 0
10 4 b Delta 0
11 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 0
14 5 a Alpha 0
我想启用在新工厂装载到卡车上的所有具有 ColA 和 ColB 之间唯一组合的工作人员。对于那些重复项(双胞胎、三胞胎等,可能通过博卡诺夫斯基过程),其中 ColA 和 ColB 的独特组合有多个工人,我想从每组重复项中仅选择一个。为了使问题变得更加困难,我希望另外能够根据 UserType 以某种形式的 ORDER BY
从每组重复项中选择一个。我可能希望选择用户类型为“Alpha”的第一个“重复”,以解决一个非常聪明的问题,或者ORDER BY UserType DESC
,这样我就可以为最低的工人。
您可以看到 ID 9、10 和 13 具有 ColA 和 ColB 的独特组合,并且最容易识别。然而,1-a、1-b、2-a、2-b 和 5-a 组合中存在重复项。
到目前为止,我目前的流程是:
0) 每个人都有一个唯一的 ID 号。这是在出生时完成的。
1) 将所有 Workers 设置为 LoadMe = 1。
UPDATE Workers
SET LoadMe = 1
2) 根据两列中的相似性查找我的重复项(GROUP BY ColA、ColB):
SELECT Wk1.*
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
ORDER BY ColA, ColB
3) 将所有重复项设置为 LoadMe = 0。
UPDATE Workers
SET LoadMe = 0
WHERE ID IN (
SELECT Wk1.ID
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
)
4 )对于我的 GROUP BY
中的每组重复项,ORDER
ed BY UserType
、SELECT
仅一个,列表中的第一个,将 LoadMe SET
设置为 1。
该表如下所示:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
2 1 b Beta 1
3 2 a Alpha 1
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 1
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 1
10 4 b Delta 1
11 5 a Alpha 1
12 5 a Beta 0
13 5 b Gamma 1
14 5 a Alpha 0
ORDER
ed BY
ColA, ColB, UserType,然后ID,并按 GROUP BY
列进行细分,(最后为了清晰起见,将其间隔开),相同的数据可能看起来像:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
7 1 a Delta 0
2 1 b Beta 1
8 1 b Epsilon 0
9 1 c Gamma 1
3 2 a Alpha 1
4 2 a Beta 0
6 2 b Alpha 1
5 2 b Delta 0
10 4 b Delta 1
11 5 a Alpha 1
14 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 1
我对最后一步感到困惑,感觉自己像一个 Epsilon-minus 半白痴。我之前一直将重复项从数据库中提取到程序空间中并在 Python 中工作,但这种情况并不罕见,我想更永久地解决这个问题。
I have a problem which is a bit beyond me (I'm really awfully glad I'm a Beta) involving duplicates (so GROUP BY
, HAVING
, COUNT
), compounded by keeping the solution within the standard functions that came with SQLite. I am using the sqlite3
module from Python.
Example table workers, Columns:
* ID: integer, auto-incrementing
* ColA: integer
* ColB: varchar(20)
* UserType: varchar(20)
* LoadMe: Boolean
(Yes, SQLite's datatypes are nominal)
My data table, Workers, at start looks like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 0
2 1 b Beta 0
3 2 a Alpha 0
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 0
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 0
10 4 b Delta 0
11 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 0
14 5 a Alpha 0
I would like to enable, for Loading onto trucks at a new factory, all workers who have unique combinations between ColA and ColB. For those duplicates (twins, triplets, etc., perhaps via Bokanovsky's Process) where unique combinations of ColA and ColB have more than one worker, I would like to select only one from each set of duplicates. To make the problem harder, I would like to additionally be able to make the selection one from each set of duplicates on the basis of UserType in some form of ORDER BY
. I may wish to select the first "duplicate" with a UserType of "Alpha," to work on a frightfully clever problem, or ORDER BY UserType DESC
, that I may issue an order for black tunics for the lowest of the workers.
You can see that IDs 9, 10, and 13 have unique combinations of ColA and ColB and are most easily identified. The 1-a, 1-b, 2-a, 2-b, and 5-a combinations, however, have duplicates within them.
My current process, as it stands so far:
0) Everyone comes with a unique ID number. This is done at birth.
1) SET
all Workers to LoadMe = 1.
UPDATE Workers
SET LoadMe = 1
2) Find my duplicates based on their similarity in two columns (GROUP BY ColA, ColB):
SELECT Wk1.*
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
ORDER BY ColA, ColB
3) SET all of my duplicates to LoadMe = 0.
UPDATE Workers
SET LoadMe = 0
WHERE ID IN (
SELECT Wk1.ID
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
)
4) For each set of duplicates in my GROUP BY
, ORDER
ed BY UserType
, SELECT
only one, the first in the list, to have LoadMe SET
to 1.
This table would look like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
2 1 b Beta 1
3 2 a Alpha 1
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 1
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 1
10 4 b Delta 1
11 5 a Alpha 1
12 5 a Beta 0
13 5 b Gamma 1
14 5 a Alpha 0
ORDER
ed BY
ColA, ColB, UserType, then ID, and broken out by the GROUP BY
columns, (and finally spaced for clarity) that same data might look like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
7 1 a Delta 0
2 1 b Beta 1
8 1 b Epsilon 0
9 1 c Gamma 1
3 2 a Alpha 1
4 2 a Beta 0
6 2 b Alpha 1
5 2 b Delta 0
10 4 b Delta 1
11 5 a Alpha 1
14 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 1
I am confounded on the last step and feel like an Epsilon-minus semi-moron. I had previously been pulling the duplicates out of the database into program space and working within Python, but this situation arises not infrequently and I would like to more permanently solve this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我喜欢把这样的问题稍微分解一下。第一步是识别唯一的 ColA、ColB 对:
现在,对于每个对,您希望找到最高优先级的记录。连接不起作用,因为每个唯一对最终都会有多个记录,但子查询将起作用:
您可以更改子查询中的
ORDER BY
子句来控制优先级。LIMIT 1
确保每个子查询只有一条记录(否则 sqlite 将返回与WHERE
子句匹配的最后一条记录,尽管我不确定是否能保证) 。此查询的结果是要加载的记录列表,其中包含
ColA、ColB、id
。我可能会直接从此工作并摆脱LoadMe
但如果您想保留它,您可以这样做:清除 LoadMe 标志,然后为我们返回的每个记录将其设置为 1最后一次查询。该事务保证这一切作为一个步骤发生或失败,并且永远不会让您的
LoadMe
字段处于不一致的状态。I like to break a problem like this up a bit. The first step is to identify the unique ColA,ColB pairs:
Now for each of these pairs you want to find the highest priority record. A join won't work because you'll end up with multiple records for each unique pair but a subquery will work:
You can change the
ORDER BY
clause in the subquery to control the priority.LIMIT 1
ensures that there is only one record for each subquery (otherwise sqlite will return the last record that matches theWHERE
clause, although I'm not sure that that's guaranteed).The result of this query is a list of records to be loaded with
ColA, ColB, id
. I would probably work directly from that and get rid ofLoadMe
but if you want to keep it you could do this:That clears the LoadMe flag and then sets it to 1 for each of the records returned by our last query. The transaction guarantees that this all takes place or fails as one step and never leaves your
LoadMe
fields in an inconsistent state.