使用 GROUP BY 与 DISTINCT 时的巨大性能差异

发布于 2024-12-12 20:10:13 字数 2951 浏览 5 评论 0原文

我正在 HSQLDB 服务器上执行一些测试,其中的表包含 500 000 个条目。该表没有索引。有 5000 个不同的业务密钥。我需要他们的名单。

自然地,我从 DISTINCT 查询开始:

SELECT DISTINCT business_key
FROM memory
WHERE concept <> 'case'   OR 
      attrib  <> 'status' OR 
      value   <> 'closed';

大约需要 90 秒!

然后我尝试使用GROUP BY

SELECT business_key
FROM memory
WHERE concept <> 'case'   OR 
      attrib  <> 'status' OR
      value   <> 'closed';
GROUP BY business_key

并且需要1秒!

为了找出差异,我运行了 EXLAIN PLAN FOR 但它似乎为两个查询提供了相同的信息。

EXLAIN PLAN FOR DISTINCT ...

isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

编辑

我做了额外的测试。由于 HSQLDB 中有 500 000 条记录且具有所有不同的业务键,DISTINCT 的性能现在更好 - 3 秒,而 GROUP BY 则需要大约9秒。

在 MySQL 中,两个查询执行相同的操作:

MySQL:500 000 行 - 5 000 个不同的业务键: 两个查询:0.5 秒 MySQL:500 000 行 - 所有不同的业务键: 选择不同... - 11 秒 SELECT ... GROUP BY Business_key - 13 秒

因此问题仅与 HSQLDB 有关。

如果有人能解释为什么会有如此巨大的差异,我将非常感激。

I am performing some tests on a HSQLDB server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them.

Naturally I started with a DISTINCT query:

SELECT DISTINCT business_key
FROM memory
WHERE concept <> 'case'   OR 
      attrib  <> 'status' OR 
      value   <> 'closed';

It takes around 90 seconds!!!

Then I tried using GROUP BY:

SELECT business_key
FROM memory
WHERE concept <> 'case'   OR 
      attrib  <> 'status' OR
      value   <> 'closed';
GROUP BY business_key

And it takes 1 second!!!

Trying to figure out the difference I ran EXLAIN PLAN FOR but it seems to give the same information for both queries.

EXLAIN PLAN FOR DISTINCT ...

isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EDIT

I did additional tests. With 500 000 records in HSQLDB with all distinct business keys, the performance of DISTINCT is now better - 3 seconds, vs GROUP BY which took around 9 seconds.

In MySQL both queries preform the same:

MySQL: 500 000 rows - 5 000 distinct business keys:
Both queries: 0.5 second
MySQL: 500 000 rows - all distinct business keys:
SELECT DISTINCT ... - 11 seconds
SELECT ... GROUP BY business_key - 13 seconds

So the problem is only related to HSQLDB.

I will be very grateful if someone can explain why there is such a drastic difference.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

花开雨落又逢春i 2024-12-19 20:10:13

这两个查询表达了相同的问题。显然查询优化器选择了两个不同的执行计划。我的猜测是,distinct 方法的执行方式如下:

  • 将所有 business_key 值复制到临时表中
  • 对临时表进行排序
  • 扫描临时表,返回与以下内容不同的每个项目 扫描

group by 的执行方式如下:

  • 全表,将 business key 的每个值存储在哈希表中
  • 返回哈希表的键

第一种方法进行优化对于内存使用:当一部分临时表必须被换出。第二种方法优化了速度,但如果有很多不同的键,则可能需要大量内存。

由于您有足够的内存或很少有不同的密钥,因此第二种方法的性能优于第一种方法。两个执行计划之间出现 10 倍甚至 100 倍的性能差异并不罕见。

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • Copy all business_key values to a temporary table
  • Sort the temporary table
  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文