对多列进行 DISTINCT 计数
有没有更好的方法来执行这样的查询:
SELECT COUNT(*)
FROM (SELECT DISTINCT DocumentId, DocumentSessionId
FROM DocumentOutputItems) AS internalQuery
我需要计算此表中不同项目的数量,但不同项目超过两列。
我的查询工作正常,但我想知道是否可以仅使用一个查询(不使用子查询)获得最终结果
Is there a better way of doing a query like this:
SELECT COUNT(*)
FROM (SELECT DISTINCT DocumentId, DocumentSessionId
FROM DocumentOutputItems) AS internalQuery
I need to count the number of distinct items from this table but the distinct is over two columns.
My query works fine but I was wondering if I can get the final result using just one query (without using a sub-query)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(20)
如果您想提高性能,可以尝试在两列的哈希值或串联值上创建持久计算列。
一旦它被持久化,只要该列是确定性的并且您使用“健全的”数据库设置,就可以对其进行索引和/或可以在其上创建统计信息。
我相信计算列的不同计数将相当于您的查询。
If you are trying to improve performance, you could try creating a persisted computed column on either a hash or concatenated value of the two columns.
Once it is persisted, provided the column is deterministic and you are using "sane" database settings, it can be indexed and / or statistics can be created on it.
I believe a distinct count of the computed column would be equivalent to your query.
编辑:从不太可靠的仅校验和查询更改
我发现了一种方法(在 SQL Server 2005 中)对我来说非常有效,并且我可以根据需要使用尽可能多的列(通过将它们添加到 CHECKSUM() 函数)。 REVERSE() 函数将 int 转换为 varchar 以使不同的值更可靠
Edit: Altered from the less-than-reliable checksum-only query
I've discovered a way to do this (in SQL Server 2005) that works pretty well for me and I can use as many columns as I need (by adding them to the CHECKSUM() function). The REVERSE() function turns the ints into varchars to make the distinct more reliable
要作为单个查询运行,请连接列,然后获取连接字符串实例的不同计数。
在 MySQL 中,您可以执行相同的操作,而无需连接步骤,如下所示:
MySQL 文档中提到了此功能:
http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct
To run as a single query, concatenate the columns, then get the distinct count of instances of the concatenated string.
In MySQL you can do the same thing without the concatenation step as follows:
This feature is mentioned in the MySQL documentation:
http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct
您不喜欢现有查询的哪些方面?如果您担心两列之间的
DISTINCT
不只返回唯一的排列,为什么不尝试一下呢?它确实可以像您在 Oracle 中所期望的那样工作。
编辑
我在分析方面走入了死胡同,但答案是令人沮丧的明显...
编辑2
鉴于以下数据,上面提供的连接解决方案将计算错误:
所以我们包括分隔符...
显然,所选分隔符必须是一个字符或一组字符,并且永远不能出现在任一列中。
What is it about your existing query that you don't like? If you are concerned that
DISTINCT
across two columns does not return just the unique permutations why not try it?It certainly works as you might expect in Oracle.
edit
I went down a blind alley with analytics but the answer was depressingly obvious...
edit 2
Given the following data the concatenating solution provided above will miscount:
So we to include a separator...
Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.
一些 SQL 数据库可以使用元组表达式,因此您可以这样做:
如果您的数据库不支持此功能,则可以根据 @oncel-umut-turer 的 CHECKSUM 或其他提供良好唯一性的标量函数的建议进行模拟,例如
COUNT(DISTINCT CONCAT(DocumentId, ':', DocumentSessionId))
。MySQL 特别支持
COUNT(DISTINCT expr, expr, ...)
这是非 SQL 标准语法。它还指出在标准 SQL 中,您必须对 COUNT(DISTINCT ...) 内的所有表达式进行串联。
元组的相关用途是执行
IN
查询,例如作为:Some SQL databases can work with a tuple expression so you can just do:
If your database doesn't support this, it can be simulated as per @oncel-umut-turer's suggestion of CHECKSUM or other scalar function providing good uniqueness e.g.
COUNT(DISTINCT CONCAT(DocumentId, ':', DocumentSessionId))
.MySQL specifically supports
COUNT(DISTINCT expr, expr, ...)
which is non-SQL standard syntax. It also notesIn standard SQL, you would have to do a concatenation of all expressions inside COUNT(DISTINCT ...).
A related use of tuples is performing
IN
queries such as:怎么样:
可能只是做和你已经做的一样的事情,但它避免了 DISTINCT。
How about something like:
Probably just does the same as you are already though but it avoids the DISTINCT.
这是一个没有子选择的较短版本:
它在 MySQL 中运行良好,我认为优化器可以更轻松地理解这个版本。
编辑:显然我误读了 MSSQL 和 MySQL - 对此感到抱歉,但也许它无论如何都有帮助。
Here's a shorter version without the subselect:
It works fine in MySQL, and I think that the optimizer has an easier time understanding this one.
Edit: Apparently I misread MSSQL and MySQL - sorry about that, but maybe it helps anyway.
我已经使用了这种方法并且它对我很有效。
对于我的情况,它提供了正确的结果。
I have used this approach and it has worked for me.
For my case, it provides correct result.
如果您使用固定长度的数据类型,则可以转换为
binary
来非常轻松且快速地完成此操作。假设DocumentId
和DocumentSessionId
都是int
,因此长度为 4 个字节...我的具体问题要求我划分
SUM
由各种外键和日期字段的不同组合的COUNT
得出,按另一个外键分组,偶尔按某些值或键进行过滤。表非常大,使用子查询会大大增加查询时间。由于其复杂性,统计根本不是一个可行的选择。CHECKSUM
解决方案的转换速度也太慢,特别是由于数据类型不同,我不能冒其不可靠性的风险。然而,使用上述解决方案几乎没有增加查询时间(与仅使用
SUM
相比),并且应该是完全可靠的!它应该能够帮助处于类似情况的其他人,所以我将其发布在这里。If you're working with datatypes of fixed length, you can cast to
binary
to do this very easily and very quickly. AssumingDocumentId
andDocumentSessionId
are bothint
s, and are therefore 4 bytes long...My specific problem required me to divide a
SUM
by theCOUNT
of the distinct combination of various foreign keys and a date field, grouping by another foreign key and occasionally filtering by certain values or keys. The table is very large, and using a sub-query dramatically increased the query time. And due to the complexity, statistics simply wasn't a viable option. TheCHECKSUM
solution was also far too slow in its conversion, particularly as a result of the various data types, and I couldn't risk its unreliability.However, using the above solution had virtually no increase on the query time (comparing with using simply the
SUM
), and should be completely reliable! It should be able to help others in a similar situation so I'm posting it here.您的查询没有任何问题,但您也可以这样做:
There's nothing wrong with your query, but you could also do it this way:
怎么样,
这将为我们提供 DocumentId 和 DocumentSessionId 所有可能组合的计数
How about this,
This will get us the count of all possible combinations of DocumentId, and DocumentSessionId
如果只有一个字段为“DISTINCT”,则可以使用:
并且返回与原始查询计划相同的查询计划,如使用 SET SHOWPLAN_ALL ON 测试的那样。但是,您正在使用两个字段,因此您可以尝试一些疯狂的事情,例如:
但如果涉及 NULL,您将会遇到问题。我只是坚持原来的查询。
if you had only one field to "DISTINCT", you could use:
and that does return the same query plan as the original, as tested with SET SHOWPLAN_ALL ON. However you are using two fields so you could try something crazy like:
but you'll have issues if NULLs are involved. I'd just stick with the original query.
我希望 MS SQL 也可以执行类似 COUNT(DISTINCT A, B) 的操作。但它不能。
起初,JayTee 的答案对我来说似乎是一个解决方案,但经过一些测试 CHECKSUM() 未能创建唯一值。一个简单的例子是,CHECKSUM(31,467,519) 和 CHECKSUM(69,1120,823) 给出了相同的答案,即 55。
然后我做了一些研究,发现 Microsoft 不建议使用 CHECKSUM 进行更改检测。在一些论坛中,有人建议使用
,但这也令人不安。
您可以按照 TSQL CHECKSUM 难题中的建议使用 HASHBYTES() 函数。然而,这也有很小的机会不返回唯一的结果。
我建议使用
I wish MS SQL could also do something like COUNT(DISTINCT A, B). But it can't.
At first JayTee's answer seemed like a solution to me bu after some tests CHECKSUM() failed to create unique values. A quick example is, both CHECKSUM(31,467,519) and CHECKSUM(69,1120,823) gives the same answer which is 55.
Then I made some research and found that Microsoft does NOT recommend using CHECKSUM for change detection purposes. In some forums some suggested using
but this is also not conforting.
You can use HASHBYTES() function as suggested in TSQL CHECKSUM conundrum. However this also has a small chance of not returning unique results.
I would suggest using
希望这能起作用,我正在 prima vista 上写
Hope this works i am writing on prima vista
我在 Google 搜索自己的问题时发现了这一点,发现如果计算 DISTINCT 对象,则会得到返回的正确数字(我正在使用 MySQL)
I found this when I Googled for my own issue, found that if you count DISTINCT objects, you get the correct number returned (I'm using MySQL)
这对我有用。在 Oracle 中:
在 jpql 中:
It works for me. In oracle:
In jpql:
我有一个类似的问题,但我的查询是一个子查询,其中包含主查询中的比较数据。就像:
忽略这个的复杂性,我意识到我无法使用原始问题中描述的双子查询将 a.code 的值获取到子查询中
所以最终我发现我可以作弊,并组合列
:是最终起作用的
I had a similar question but the query I had was a sub-query with the comparison data in the main query. something like:
ignoring the complexities of this, I realized I couldn't get the value of a.code into the subquery with the double sub query described in the original question
So eventually I figured out I could cheat, and combine the columns:
This is what ended up working
此查询帮助我(在 MSSQL 中)找到具有多个列的唯一计数。
This query helps me (in MSSQL) find the unique count with more than one column.
此代码在 2 个参数上使用不同的参数,并提供特定于这些不同值行计数的行数计数。它在 MySQL 中对我来说就像一个魅力。
This code uses distinct on 2 parameters and provides count of number of rows specific to those distinct values row count. It worked for me in MySQL like a charm.
您只需使用计数功能两次即可。
在这种情况下,它将是:
You can just use the Count Function Twice.
In this case, it would be: