我如何(或可以)在多个列上选择 DISTINCT?

发布于 2024-07-05 03:57:37 字数 314 浏览 8 评论 0原文

我需要从表中检索所有行,其中两列的组合都不同。 因此,我希望同一天没有任何其他销售的所有销售都以相同的价格进行。 基于日期和价格的唯一销售将更新为活动状态。

所以我在想:

UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
             FROM sales
             HAVING count = 1)

但如果再往前走,我的大脑就会受伤。

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.

So I'm thinking:

UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
             FROM sales
             HAVING count = 1)

But my brain hurts going any farther than that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

咿呀咿呀哟 2024-07-12 03:57:37
SELECT DISTINCT a,b,c FROM t

大致相当于:

SELECT a,b,c FROM t GROUP BY a,b,c

习惯 GROUP BY 语法是个好主意,因为它更强大。

对于您的查询,我会这样做:

UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
    SELECT id
    FROM sales S
    INNER JOIN
    (
        SELECT saleprice, saledate
        FROM sales
        GROUP BY saleprice, saledate
        HAVING COUNT(*) = 1 
    ) T
    ON S.saleprice=T.saleprice AND s.saledate=T.saledate
 )
SELECT DISTINCT a,b,c FROM t

is roughly equivalent to:

SELECT a,b,c FROM t GROUP BY a,b,c

It's a good idea to get used to the GROUP BY syntax, as it's more powerful.

For your query, I'd do it like this:

UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
    SELECT id
    FROM sales S
    INNER JOIN
    (
        SELECT saleprice, saledate
        FROM sales
        GROUP BY saleprice, saledate
        HAVING COUNT(*) = 1 
    ) T
    ON S.saleprice=T.saleprice AND s.saledate=T.saledate
 )
GRAY°灰色天空 2024-07-12 03:57:37

如果您将到目前为止的答案放在一起,进行清理和改进,您将得到这个高级查询:

UPDATE sales
SET    status = 'ACTIVE'
WHERE  (saleprice, saledate) IN (
    SELECT saleprice, saledate
    FROM   sales
    GROUP  BY saleprice, saledate
    HAVING count(*) = 1 
    );

这比它们中的任何一个都快得多。 将当前接受的答案的性能提高 10 - 15 倍(在我对 PostgreSQL 8.4 和 9.1 的测试中)。

但这还远非最佳。 使用 NOT EXISTS(反)半连接以获得更好的性能。 EXISTS 是标准 SQL,一直存在(至少从 PostgreSQL 7.2 开始,早在这个问题被提出之前)并且完全符合所提出的要求:

UPDATE sales s
SET    status = 'ACTIVE'
WHERE  NOT EXISTS (
   SELECT FROM sales s1                     -- SELECT list can be empty for EXISTS
   WHERE  s.saleprice = s1.saleprice
   AND    s.saledate  = s1.saledate
   AND    s.id <> s1.id                     -- except for row itself
   )
AND    s.status IS DISTINCT FROM 'ACTIVE';  -- avoid empty updates. see below

db<>fiddle 此处
sqlfiddle

用于标识行的唯一键

(如果您不这样做)具有表的主键或唯一键(示例中的 id),您可以出于此查询的目的用系统列 ctid 替换(但不能用于其他某些查询)目的):

   AND    s1.ctid <> s.ctid

每个表都应该有一个主键。 如果您还没有,请添加一个。 我建议在 Postgres 10+ 中使用 serialIDENTITY 列。

相关:

这有多快?

一旦找到第一个重复项,EXISTS 反半连接中的子查询就可以停止计算(没有必要进一步查找)。 对于具有很少重复项的基表,这只是稍微更有效。 对于大量重复项,这会变得更加高效。

排除空更新

对于已经具有 status = 'ACTIVE' 的行,此更新不会更改任何内容,但仍以全部成本插入新的行版本(存在较小的例外情况)。 通常,您不希望这样。 添加另一个像上面演示的 WHERE 条件以避免这种情况并使其更快:

如果 status 定义为 NOT NULL,您可以简化为

AND status <> 'ACTIVE';

:列的数据类型必须支持 <> 运算符。 有些类型(例如 json)则不然。 请参阅:

NULL 处理中的细微差别

此查询(与 Joel 当前接受的答案不同)不处理 NULL值相等。 以下两行 (saleprice, saledate) 将被视为“不同”(尽管与人眼看起来相同):

(123, NULL)
(123, NULL)

还传递唯一索引以及几乎任何其他位置,因为 NULL 值不进行比较根据 SQL 标准,相等。 请参阅:

OTOH,GROUP BYDISTINCTDISTINCT ON () 将 NULL 值视为相等。 根据您想要实现的目标,使用适当的查询样式。 您仍然可以通过 使用此更快的查询对于任何或所有比较,使用 IS NOT DISTINCT FROM 代替 = 以使 NULL 比较相等。 更多:

如果所有被比较的列都定义为NOT NULL,则没有分歧的余地。

If you put together the answers so far, clean up and improve, you would arrive at this superior query:

UPDATE sales
SET    status = 'ACTIVE'
WHERE  (saleprice, saledate) IN (
    SELECT saleprice, saledate
    FROM   sales
    GROUP  BY saleprice, saledate
    HAVING count(*) = 1 
    );

Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).

But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:

UPDATE sales s
SET    status = 'ACTIVE'
WHERE  NOT EXISTS (
   SELECT FROM sales s1                     -- SELECT list can be empty for EXISTS
   WHERE  s.saleprice = s1.saleprice
   AND    s.saledate  = s1.saledate
   AND    s.id <> s1.id                     -- except for row itself
   )
AND    s.status IS DISTINCT FROM 'ACTIVE';  -- avoid empty updates. see below

db<>fiddle here
Old sqlfiddle

Unique key to identify row

If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):

   AND    s1.ctid <> s.ctid

Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.

Related:

How is this faster?

The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.

Exclude empty updates

For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:

If status is defined NOT NULL, you can simplify to:

AND status <> 'ACTIVE';

The data type of the column must support the <> operator. Some types like json don't. See:

Subtle difference in NULL handling

This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):

(123, NULL)
(123, NULL)

Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:

OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:

If all columns being compared are defined NOT NULL, there is no room for disagreement.

对你而言 2024-07-12 03:57:37

您的查询的问题在于,当使用 GROUP BY 子句(本质上是通过使用不同的)时,您只能使用分组依据或聚合函数的列。 您不能使用列 ID,因为可能存在不同的值。 在您的情况下,由于 HAVING 子句,始终只有一个值,但大多数 RDBMS 不够智能,无法识别这一点。

不过,这应该可行(并且不需要联接):

UPDATE sales
SET status='ACTIVE'
WHERE id IN (
  SELECT MIN(id) FROM sales
  GROUP BY saleprice, saledate
  HAVING COUNT(id) = 1
)

您还可以使用 MAX 或 AVG 而不是 MIN,只有在只有一个匹配行时才使用返回列值的函数才重要。

The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.

This should work however (and doesn't need a join):

UPDATE sales
SET status='ACTIVE'
WHERE id IN (
  SELECT MIN(id) FROM sales
  GROUP BY saleprice, saledate
  HAVING COUNT(id) = 1
)

You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.

榆西 2024-07-12 03:57:37

如果您的 DBMS 不支持多列的不同,如下所示:

select distinct(col1, col2) from table

多重选择通常可以安全地执行,如下所示:

select distinct * from (select col1, col2 from table ) as x

因为这可以在大多数 DBMS 上工作,并且预计这比分组解决方案更快,因为您避免了分组功能。

If your DBMS doesn't support distinct with multiple columns like this:

select distinct(col1, col2) from table

Multi select in general can be executed safely as follows:

select distinct * from (select col1, col2 from table ) as x

As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.

因为看清所以看轻 2024-07-12 03:57:37

我想从“GrondOfLucht”一列中选择不同的值,但它们应该按照“排序”列中给出的顺序进行排序。 我无法使用它仅获得一列的不同值,

Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering

它还会给出“排序”列,并且因为“GrondOfLucht”和“排序”不唯一,所以结果将是所有行。

使用 GROUP 按照 'sorting 给出的顺序选择 'GrondOfLucht' 的记录

SELECT        GrondOfLucht
FROM            dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using

Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering

It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.

use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering

SELECT        GrondOfLucht
FROM            dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文