当相同的查询加入完全相同但日期时,如何提高针对数据保险库体系结构的查询性能?

发布于 2025-02-09 04:18:22 字数 3297 浏览 2 评论 0原文

我们需要在公共位置,项目和表单上获取最小和最大日期。

此类查询的伪代码看起来像:

SELECT A.FORM, 
A.LOCATION,
A.PROJECT,
MN.START_DATE,
MX.END_DATE

FROM

(
) A

JOIN (
SELECT ...
FROM 
-- JOINS ON N TABLES

) MN ON MN.FORM = A.FORM AND MN.LOCATION = A.LOCATION AND MN.PROJECT = A.PROJECT

JOIN (
-- SAME JOINS ON SAME N TABLES

) MX ON MX.FORM = A.FORM AND MX.LOCATION = A.LOCATION AND MX.PROJECT = A.PROJECT

MN和MX的查询完全相同,但日期字段,而最小值(日期)来自MN,Max(Date)来自MX JOIN。

是否可以将两个MN组合到MX中,以减少执行时间并提高性能,因为它正在运行数千行数据。

这是确切的查询:

SELECT 
MX.ACTIVITY_CODE,
MN.FORM_ID_STRING, 
MX.PROJECT_CODE, 
DATE(MN.MIN_START_DATE) AS MIN_START_DATE, 
DATE(MX.MAX_END_DATE) AS MAX_END_DATE

FROM
(
  -- Getting the min start date of submissions related to same location, project and activity
  SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
  MIN(SO.OBSERVATION_START_DT) MIN_START_DATE
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
  WHERE 
  SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
  AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)

  AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
  GROUP BY  HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MN
JOIN
(
  -- Getting the max end date of submissions related to same location, project and activity
  SELECT HO.FORM_ID_STRING as form, HO.LOCATION_NAME as loc, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
  MAX(SO.OBSERVATION_START_DT) as MAX_END_DATE
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
  WHERE 
//  HO.FORM_ID_STRING = 'Global_rh_skilled_facility_delivery_v1_0_1_data.csv'
  SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
  AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)

  AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
  GROUP BY  HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MX 
ON MX.form = MN.form_id_string AND MX.loc = MN.location_name AND MN.PROJECT_CODE = MX.PROJECT_CODE

We need to get the minimum and maximum dates over common location, project and form.

the pseudo-code of such query looks like:

SELECT A.FORM, 
A.LOCATION,
A.PROJECT,
MN.START_DATE,
MX.END_DATE

FROM

(
) A

JOIN (
SELECT ...
FROM 
-- JOINS ON N TABLES

) MN ON MN.FORM = A.FORM AND MN.LOCATION = A.LOCATION AND MN.PROJECT = A.PROJECT

JOIN (
-- SAME JOINS ON SAME N TABLES

) MX ON MX.FORM = A.FORM AND MX.LOCATION = A.LOCATION AND MX.PROJECT = A.PROJECT

query of MN and MX are exactly the same but the date field whereas MIN(DATE) is coming from MN, and MAX(DATE) is coming from MX join.

is it possible to combine both MN into MX in a way to reduce execution time and increase performance as this is running over thousands of rows of data.

here is the exact query:

SELECT 
MX.ACTIVITY_CODE,
MN.FORM_ID_STRING, 
MX.PROJECT_CODE, 
DATE(MN.MIN_START_DATE) AS MIN_START_DATE, 
DATE(MX.MAX_END_DATE) AS MAX_END_DATE

FROM
(
  -- Getting the min start date of submissions related to same location, project and activity
  SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
  MIN(SO.OBSERVATION_START_DT) MIN_START_DATE
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
  WHERE 
  SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
  AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)

  AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
  GROUP BY  HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MN
JOIN
(
  -- Getting the max end date of submissions related to same location, project and activity
  SELECT HO.FORM_ID_STRING as form, HO.LOCATION_NAME as loc, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
  MAX(SO.OBSERVATION_START_DT) as MAX_END_DATE
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
  WHERE 
//  HO.FORM_ID_STRING = 'Global_rh_skilled_facility_delivery_v1_0_1_data.csv'
  SO.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)
  AND SOR.LOAD_DT = (SELECT MAX(LOAD_DT) FROM DATA_VAULT.SAT_OBSERVATION_REVIEW WHERE OBSERVATION_HKEY = HO.OBSERVATION_HKEY)

  AND SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
  GROUP BY  HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE
) MX 
ON MX.form = MN.form_id_string AND MX.loc = MN.location_name AND MN.PROJECT_CODE = MX.PROJECT_CODE

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一生独一 2025-02-16 04:18:22

@HoneyBadger是正确的。您可以在同一子查询中获取最小和最大启动日期,因此您无需重复。这应该节省一些时间,IO和CPU。

SELECT ...,
       MIN(SO.OBSERVATION_START_DT) MIN_START_DATE,
       MAX(SO.OBSERVATION_START_DT) MAX_END_DATE

而且,您在子查询的Where子句中有几个相关的子查询。您可以将这两个更改为单个加入子查询,并节省更多时间。

这是您的单个大子查询的样子,这两个更改。 我无法调试此问题,因为我没有您的数据,并且只能猜测您的规格。

SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
       MIN(SO.OBSERVATION_START_DT) MIN_START_DATE,  /* min and max in */
       MAX(SO.OBSERVATION_START_DT) MAX_END_DATE     /* just one query */
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN (        /* non-correlated subquery to find max LOAD_DT values */
                SELECT OBSERVATION_HKEY, MAX(LOAD_DT) LOAD_DT
                  FROM DATA_VAULT.SAT_OBSERVATION
                 GROUP BY OBSERVATION_HKEY
   ) SOMAX   ON SO.LOAD_DT = SOMAX.LOAD_DT
            AND SO.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
            AND SOR.LOAD_DT = SOMAX.LOAD_DT
            AND SOR.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
 WHERE SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
 GROUP BY HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE

下一步,如果这没有足够的性能,就是查看执行计划(解释输出)并尝试弄清楚更多索引或不同的索引是否会有所帮助。正确的索引搜索最小值和最大值可能会很快。

这是另一个问题的好主题。

@HoneyBadger is correct. You can fetch both the MIN and MAX start dates in the same subquery, so you don't need to repeat it. That should save some time, IO, and CPU.

SELECT ...,
       MIN(SO.OBSERVATION_START_DT) MIN_START_DATE,
       MAX(SO.OBSERVATION_START_DT) MAX_END_DATE

And, you have a couple of correlated subqueries in the WHERE clause of your subquery. You can change both of those to a single JOINed subquery and save some more time.

This is what your single big subquery might look like with those two changes. I cannot debug this because I don't have your data and can only guess at your spec.

SELECT HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE,
       MIN(SO.OBSERVATION_START_DT) MIN_START_DATE,  /* min and max in */
       MAX(SO.OBSERVATION_START_DT) MAX_END_DATE     /* just one query */
  FROM DATA_VAULT.HUB_OBSERVATION HO 
  JOIN DATA_VAULT.SAT_OBSERVATION SO ON HO.OBSERVATION_HKEY = SO.OBSERVATION_HKEY
  JOIN DATA_VAULT.SAT_OBSERVATION_REVIEW SOR ON SOR.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN (        /* non-correlated subquery to find max LOAD_DT values */
                SELECT OBSERVATION_HKEY, MAX(LOAD_DT) LOAD_DT
                  FROM DATA_VAULT.SAT_OBSERVATION
                 GROUP BY OBSERVATION_HKEY
   ) SOMAX   ON SO.LOAD_DT = SOMAX.LOAD_DT
            AND SO.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
            AND SOR.LOAD_DT = SOMAX.LOAD_DT
            AND SOR.OBSERVATION_HKEY = SOMAX.OBSERVATION_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_PROJECT_CODE LOPC ON LOPC.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_PROJECT_CODE HPC ON HPC.PROJECT_CODE_HKEY = LOPC.PROJECT_CODE_HKEY
  JOIN DATA_VAULT.LNK_OBSERVATION_COUNTRY_ACTIVITY LOCA ON LOCA.OBSERVATION_HKEY = HO.OBSERVATION_HKEY
  JOIN DATA_VAULT.HUB_ACTIVITY HA ON HA.ACTIVITY_HKEY = LOCA.ACTIVITY_HKEY
 WHERE SOR.REVIEW_STATUS NOT IN ('REJECTED', 'DELETED')
 GROUP BY HO.FORM_ID_STRING, HO.LOCATION_NAME, HPC.PROJECT_CODE, HA.ACTIVITY_CODE

The next step, if this doesn't give you enough performance, is to look at the execution plan (the EXPLAIN output) and try to work out whether some more indexes or different indexes will help. Searching for MIN and MAX values can be astonishingly fast with the right indexes.

That is a good subject for another question.

两仪 2025-02-16 04:18:22

尽可能获得最大和最小值,而无需加入。

如果可能的话,在同一查询中获取最大和最小值。

获取一次最大(load_dt),而不是两次。

然后加入所有其他表。

这些索引中的一些可能很有用:

HO:  INDEX(FORM_ID_STRING, OBSERVATION_HKEY,  LOCATION_NAME)
HPC:  INDEX(PROJECT_CODE_HKEY,  PROJECT_CODE)
HA:  INDEX(ACTIVITY_HKEY,  ACTIVITY_CODE)
SO:  INDEX(LOAD_DT, OBSERVATION_HKEY,  OBSERVATION_START_DT)
SOR:  INDEX(LOAD_DT, REVIEW_STATUS, OBSERVATION_HKEY)
LOPC:  INDEX(OBSERVATION_HKEY,  PROJECT_CODE_HKEY)
LOCA:  INDEX(OBSERVATION_HKEY,  ACTIVITY_HKEY)
HO:  INDEX(OBSERVATION_HKEY)

为了进一步讨论,请提供为所有表显示创建表;听起来他们可能过于归一化。

Get the max and min without all those joins, if possible.

Get the max and min in the same query, if possible.

Get MAX(LOAD_DT) once, not twice.

Then JOIN to all the other tables.

Some of these indexes may be useful:

HO:  INDEX(FORM_ID_STRING, OBSERVATION_HKEY,  LOCATION_NAME)
HPC:  INDEX(PROJECT_CODE_HKEY,  PROJECT_CODE)
HA:  INDEX(ACTIVITY_HKEY,  ACTIVITY_CODE)
SO:  INDEX(LOAD_DT, OBSERVATION_HKEY,  OBSERVATION_START_DT)
SOR:  INDEX(LOAD_DT, REVIEW_STATUS, OBSERVATION_HKEY)
LOPC:  INDEX(OBSERVATION_HKEY,  PROJECT_CODE_HKEY)
LOCA:  INDEX(OBSERVATION_HKEY,  ACTIVITY_HKEY)
HO:  INDEX(OBSERVATION_HKEY)

For further discussion, please provide SHOW CREATE TABLE for all the tables; it sounds like they might be over-normalized.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文