删除多余的SQL代码

发布于 2024-08-31 12:16:38 字数 5545 浏览 7 评论 0原文

代码

以下代码计算针对大量数据的线性回归的斜率和截距。然后，它对同一结果集应用方程 y = mx + b 来计算每行的回归线值。

如何连接两个查询以便在不执行两次 WHERE 子句的情况下计算数据及其斜率/截距？

问题的一般形式是：

SELECT a.group, func(a.group, avg_avg)
FROM a
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (SOME_CONDITION)
           GROUP BY a.group) as several_lines -- potentially
    ) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines

我有 SOME_CONDITION 执行了两次。如下所示（使用 STRAIGHT_JOIN 优化进行更新）：

SELECT STRAIGHT_JOIN
  AVG(D.AMOUNT) as AMOUNT,
  Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
  Y.YEAR as YEAR,
  MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
  ymxb.SLOPE,
  ymxb.INTERCEPT,
  ymxb.CORRELATION,
  ymxb.MEASUREMENTS
FROM
  CITY C,
  STATION S,
  STATION_DISTRICT SD,
  YEAR_REF Y,
  MONTH_REF M,
  DAILY D,
  (SELECT
    SUM(MEASUREMENTS) as MEASUREMENTS,

    ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

    ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
    (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,

    ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
    (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
  FROM (
    SELECT STRAIGHT_JOIN
      COUNT(1) as MEASUREMENTS,
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR as YEAR
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D
    WHERE
      -- For a specific city ...
      --
      $X{ IN, C.ID, CityCode } AND

      -- Find all the stations within a specific unit radius ...
      --
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND

      SD.ID = S.STATION_DISTRICT_ID AND

      -- Gather all known years for that station ...
      --
      Y.STATION_DISTRICT_ID = SD.ID AND

      -- The data before 1900 is shaky; insufficient after 2009.
      --
      Y.YEAR BETWEEN 1900 AND 2009 AND

      -- Filtered by all known months ...
      --
      M.YEAR_REF_ID = Y.ID AND

      -- Whittled down by category ...
      --
      M.CATEGORY_ID = $P{CategoryCode} AND

      -- Into the valid daily climate data.
      --
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
  ) t
) ymxb
WHERE
  -- For a specific city ...
  --
  $X{ IN, C.ID, CityCode } AND

  -- Find all the stations within a specific unit radius ...
  --
  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
     POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND

  SD.ID = S.STATION_DISTRICT_ID AND

  -- Gather all known years for that station ...
  --
  Y.STATION_DISTRICT_ID = SD.ID AND

  -- The data before 1900 is shaky; insufficient after 2009.
  --
  Y.YEAR BETWEEN 1900 AND 2009 AND

  -- Filtered by all known months ...
  --
  M.YEAR_REF_ID = Y.ID AND

  -- Whittled down by category ...
  --
  M.CATEGORY_ID = $P{CategoryCode} AND

  -- Into the valid daily climate data.
  --
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
GROUP BY
  Y.YEAR

问题

如何在每个查询中仅执行一次重复位，而不是两次？重复的代码：

  $X{ IN, C.ID, CityCode } AND
  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
     POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
  SD.ID = S.STATION_DISTRICT_ID AND
  Y.STATION_DISTRICT_ID = SD.ID AND
  Y.YEAR BETWEEN 1900 AND 2009 AND
  M.YEAR_REF_ID = Y.ID AND
  M.CATEGORY_ID = $P{CategoryCode} AND
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
GROUP BY
  Y.YEAR

更新 1

使用变量并拆分查询似乎允许缓存启动，因为现在运行时间为 3.5 秒，而过去运行时间为 7 秒。不过，如果有什么办法的话要删除重复的代码，我将不胜感激任何帮助。

<罢工> 更新 2

~~上面的代码不能在 JasperReports 中运行，而 VIEW 虽然是一个可能的修复方法，但可能效率极低（因为 WHERE 子句是参数化的）。~~

更新 3

使用 Unreason 建议的具有收敛子午线的毕达哥拉斯公式来验证距离：（

  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
    POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )

这与问题无关，但应该有人想知道......）

更新 4

如图所示，代码在 JasperReports 中运行，针对 MySQL 数据库运行。 JasperReports 不允许变量或多个查询。

更新 5

我正在寻找一个干净执行的解决方案。 ;-) 我已经编写了许多部分工作的解决方案，但遗憾的是 MySQL 无法理解部分正确。请参阅与 Unreason 的讨论以获得几乎有效的答案。

更新 6

我也许能够重用第一个 WHERE 子句中的变量，并将它们与第二个子句进行比较（从而消除一些重复 - 检查针对 $P{} 值），但我真的希望消除重复。

更新 7

按照先前更新中的假设，比较 YEAR 子句以消除重复的 BETWEEN 是行不通的。

相关

如何消除SQL中的重复计算？

谢谢！

原文

Code

The following code calculates the slope and intercept for a linear regression against a slathering of data. It then applies the equation y = mx + b against the same result set to calculate the value of the regression line for each row.

How can the two queries be joined so that the data and its slope/intercept are calculated without executing the WHERE clause twice?

The general form of the problem is:

SELECT a.group, func(a.group, avg_avg)
FROM a
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (SOME_CONDITION)
           GROUP BY a.group) as several_lines -- potentially
    ) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines

I have SOME_CONDITION executing twice. This is shown below (updated with a STRAIGHT_JOIN optimization):

SELECT STRAIGHT_JOIN
  AVG(D.AMOUNT) as AMOUNT,
  Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
  Y.YEAR as YEAR,
  MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
  ymxb.SLOPE,
  ymxb.INTERCEPT,
  ymxb.CORRELATION,
  ymxb.MEASUREMENTS
FROM
  CITY C,
  STATION S,
  STATION_DISTRICT SD,
  YEAR_REF Y,
  MONTH_REF M,
  DAILY D,
  (SELECT
    SUM(MEASUREMENTS) as MEASUREMENTS,

    ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

    ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
    (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,

    ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
    (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
  FROM (
    SELECT STRAIGHT_JOIN
      COUNT(1) as MEASUREMENTS,
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR as YEAR
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D
    WHERE
      -- For a specific city ...
      --
      $X{ IN, C.ID, CityCode } AND

      -- Find all the stations within a specific unit radius ...
      --
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND

      SD.ID = S.STATION_DISTRICT_ID AND

      -- Gather all known years for that station ...
      --
      Y.STATION_DISTRICT_ID = SD.ID AND

      -- The data before 1900 is shaky; insufficient after 2009.
      --
      Y.YEAR BETWEEN 1900 AND 2009 AND

      -- Filtered by all known months ...
      --
      M.YEAR_REF_ID = Y.ID AND

      -- Whittled down by category ...
      --
      M.CATEGORY_ID = $P{CategoryCode} AND

      -- Into the valid daily climate data.
      --
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
  ) t
) ymxb
WHERE
  -- For a specific city ...
  --
  $X{ IN, C.ID, CityCode } AND

  -- Find all the stations within a specific unit radius ...
  --
  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
     POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND

  SD.ID = S.STATION_DISTRICT_ID AND

  -- Gather all known years for that station ...
  --
  Y.STATION_DISTRICT_ID = SD.ID AND

  -- The data before 1900 is shaky; insufficient after 2009.
  --
  Y.YEAR BETWEEN 1900 AND 2009 AND

  -- Filtered by all known months ...
  --
  M.YEAR_REF_ID = Y.ID AND

  -- Whittled down by category ...
  --
  M.CATEGORY_ID = $P{CategoryCode} AND

  -- Into the valid daily climate data.
  --
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
GROUP BY
  Y.YEAR

Question

How do I execute the duplicate bits only once per query, instead of twice? The duplicate code:

  $X{ IN, C.ID, CityCode } AND
  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
     POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
  SD.ID = S.STATION_DISTRICT_ID AND
  Y.STATION_DISTRICT_ID = SD.ID AND
  Y.YEAR BETWEEN 1900 AND 2009 AND
  M.YEAR_REF_ID = Y.ID AND
  M.CATEGORY_ID = $P{CategoryCode} AND
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
GROUP BY
  Y.YEAR

Update 1

Using variables and splitting the query seems to allow the cache to kick in as this now runs in 3.5 seconds, whereas it used to run in 7. Still, if there is any way to remove the duplicate code, I'd be grateful for any help.

~~Update 2~~

The above code does not run in JasperReports, and a VIEW, while a possible fix, would probably be extremely inefficient (because the WHERE clauses are parameterized).

Update 3

Validating distance using Unreason's suggestion of the Pythagorean formula with converging meridians:

  6371.009 *
  SQRT(
    POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
    (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
    POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )

(This is unrelated to the question, but should someone else want to know ...)

Update 4

The code, as shown, works in JasperReports, running against a MySQL database. JasperReports does not allow variables or multiple queries.

Update 5

Am looking for a solution that executes cleanly. ;-) I have written a number of partially working solutions, but MySQL, sadly, does not understand partially correct. See the discussions with Unreason for answers that almost work.

Update 6

I might be able to reuse variables from the first WHERE clause and compare them to the second (thereby eliminating some duplication -- the checks against $P{} values), but I'd really like the duplication eliminated.

Update 7

Comparing the YEAR clause, as hypothesized in the previous update, to eliminate the duplicate BETWEEN, does not work.

Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

看春风乍起 2024-09-07 12:16:38

您应该能够一次性获得所需的一切：

 SELECT
    AVG(D.AMOUNT) as AMOUNT,
    Y.YEAR as YEAR,
    MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
    Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,             
    ((avg(AVG(D.AMOUNT) * Y.YEAR)) - avg(AVG(D.AMOUNT)) * avg(Y.YEAR)) /                  
    (stddev( AVG(D.AMOUNT) ) * stddev( Y.YEAR )) as CORRELATION,                     
    ((sum(Y.YEAR) * sum(AVG(D.AMOUNT))) - (count(1) * sum(Y.YEAR * AVG(D.AMOUNT)))) /
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as SLOPE,   
    ((sum( Y.YEAR ) * sum( Y.YEAR * AVG(D.AMOUNT) )) -
    (sum( AVG(D.AMOUNT) ) * sum(power(Y.YEAR, 2)))) / 
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as INTERCEPT
 FROM
    CITY C,
    STATION S,
    YEAR_REF Y,
    MONTH_REF M,
    DAILY D
 WHERE
    $X{ IN, C.ID, CityCode } AND
    SQRT(
        POW( C.LATITUDE - S.LATITUDE, 2 ) +
        POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius} AND
    S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
    Y.YEAR BETWEEN 1900 AND 2009 AND
    M.YEAR_REF_ID = Y.ID AND
    M.CATEGORY_ID = $P{CategoryCode} AND
    M.ID = D.MONTH_REF_ID AND
    D.DAILY_FLAG_ID <> 'M'
 GROUP BY
    Y.YEAR

直接从上面的查询中将无法工作（它具有无意义的组合聚合和其他错误）；这可能是检查公式的好时机

如果您决定进行子查询并简化公式，那么：

您可以抓取（确实抓取）最内部查询中的所有必要数据，而不必重复所有操作外部查询中的表不再存在（只需从 t 中选择相关列，它们已经可供您使用）
您不必重复 where 条件

You should be able to get everything you need in one go:

 SELECT
    AVG(D.AMOUNT) as AMOUNT,
    Y.YEAR as YEAR,
    MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
    Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,             
    ((avg(AVG(D.AMOUNT) * Y.YEAR)) - avg(AVG(D.AMOUNT)) * avg(Y.YEAR)) /                  
    (stddev( AVG(D.AMOUNT) ) * stddev( Y.YEAR )) as CORRELATION,                     
    ((sum(Y.YEAR) * sum(AVG(D.AMOUNT))) - (count(1) * sum(Y.YEAR * AVG(D.AMOUNT)))) /
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as SLOPE,   
    ((sum( Y.YEAR ) * sum( Y.YEAR * AVG(D.AMOUNT) )) -
    (sum( AVG(D.AMOUNT) ) * sum(power(Y.YEAR, 2)))) / 
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as INTERCEPT
 FROM
    CITY C,
    STATION S,
    YEAR_REF Y,
    MONTH_REF M,
    DAILY D
 WHERE
    $X{ IN, C.ID, CityCode } AND
    SQRT(
        POW( C.LATITUDE - S.LATITUDE, 2 ) +
        POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius} AND
    S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
    Y.YEAR BETWEEN 1900 AND 2009 AND
    M.YEAR_REF_ID = Y.ID AND
    M.CATEGORY_ID = $P{CategoryCode} AND
    M.ID = D.MONTH_REF_ID AND
    D.DAILY_FLAG_ID <> 'M'
 GROUP BY
    Y.YEAR

The things will not work straight from the query above (it has nonsensically combined aggregates and other errors); this can be a good time to check your formulas

If you decide to do sub queries do simplify the formulas, then:

you can grab (you do grab) all the necessary data in the inner most query and you don't have to repeat all the tables in the outer queries any more (just select the relevant columns from the t, they are already at your disposal)
you don't have to repeat the where condition

回复收藏 0 原文

回眸一遍 2024-09-07 12:16:38

这个问题比你的概括要困难一些。我将其表述如下：

SELECT a.group, func(a.group, avg_avg)
FROM a
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (YOUR_CONDITION)
           GROUP BY a.group) as several_lines -- potentially
    ) as one_line -- always
WHERE (YOUR_CONDITION)
GROUP BY a.group -- again, potentially several lines

您有一个数据子集（受您的条件限制），它被分组并为每个组进行聚合。然后，您将聚合向下合并为单个值，并希望再次将该值的函数应用于每个组。显然，在分组子查询的结果可以作为实体引用之前，您不能重用条件。

在 MSSQL 和 Oracle 中，您将使用 WITH 运算符。在 MySQL 中，唯一的选择是使用临时表。我假设你的报告中有一年以上的内容（否则查询会简单得多）。

UPD：抱歉，我现在无法发布准备好的代码（明天可以），但我有一个想法：

您可以将子查询中需要输出的数据与 连接起来GROUP_CONCAT 并使用 FIND_IN_SET 和 SUBSTRING_INDEX 函数将其拆分回外部查询中。外部查询将仅 JOIN YEAR_REF 和聚合结果。

外部查询中的条件将只是 WHERE FIND_IN_SET(year, concatenated_years)。

UPD：

这是使用GROUP_CONCAT将所需数据传递到外部JOIN的版本。

我的评论以 --newtover: 开头。顺便说一句，1）我认为 STRAIGHT_JOIN 不会带来任何好处，2）COUNT(*) 在 MySQL 中具有特殊含义，当您想要时应该使用计算行数。

SELECT STRAIGHT_JOIN
  -- newtover: extract the corresponding amount back
  SUBSTRING_INDEX(SUBSTRING_INDEX(GROUPED_AMOUNTS, '|', @pos),'|', -1) as AMOUNT,
  Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
  Y.YEAR as YEAR,
  MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
  ymxb.SLOPE,
  ymxb.INTERCEPT,
  ymxb.CORRELATION,
  ymxb.MEASUREMENTS
FROM
  -- newtover: list of tables now contains only the subquery, YEAR_REF for grouping and init_vars to define the variable
  YEAR_REF Y,
  (SELECT
    SUM(MEASUREMENTS) as MEASUREMENTS,
    ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
    ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
    (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
    ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
    (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION,
    -- newtover: grouped fields for matching years and the corresponding amounts
    GROUP_CONCAT(Y.YEAR) as GROUPED_YEARS,
    GROUP_CONCAT(AMOUNT SEPARATOR '|') as GROUPED_AMOUNTS
  FROM (
    SELECT STRAIGHT_JOIN
      COUNT(1) as MEASUREMENTS,
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR as YEAR
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D
    WHERE
      -- For a specific city ...
      $X{ IN, C.ID, CityCode } AND
      -- Find all the stations within a specific unit radius ...
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
      SD.ID = S.STATION_DISTRICT_ID AND
      -- Gather all known years for that station ...
      Y.STATION_DISTRICT_ID = SD.ID AND
      -- The data before 1900 is shaky; insufficient after 2009.
      Y.YEAR BETWEEN 1900 AND 2009 AND
      -- Filtered by all known months ...
      M.YEAR_REF_ID = Y.ID AND
      -- Whittled down by category ...
      M.CATEGORY_ID = $P{CategoryCode} AND
      -- Into the valid daily climate data.
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
  ) t
) ymxb,
(SELECT @pos:=NULL) as init_vars
WHERE
    -- newtover: check if the year is in the list and store the index into the variable
    @pos:=CAST(FIND_IN_SET(Y.YEAR, GROUPED_YEARS) as UNSIGNED)
GROUP BY
  Y.YEAR

The problem is a bit more difficult than in your generalization. I would state it as the following:

SELECT a.group, func(a.group, avg_avg)
FROM a
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (YOUR_CONDITION)
           GROUP BY a.group) as several_lines -- potentially
    ) as one_line -- always
WHERE (YOUR_CONDITION)
GROUP BY a.group -- again, potentially several lines

You have a subset of data (limited by your condition), which is grouped and an aggregation is made for each group. Then, you merge down aggregations to a single value and you want to apply a function of the value to each group again. Obviously, you can not reuse the condition until the result of the grouped subquery can be referenced as an entity.

In MSSQL and Oracle, you would use WITH operator. In MySQL the only option is to use a temporary table. I assume that there is more than one year in your report (otherwise, the query would be much simplier).

UPD: I am sorry, I can not post the ready code now (can do it tomorrow), but I have an idea:

You can concatenate the data you need to output in the subquery with GROUP_CONCAT AND split it back in the outer query with FIND_IN_SET, and SUBSTRING_INDEX functions. the outer query will JOIN only the YEAR_REF and the result of the aggregation.

The condition in the outer query then will be just WHERE FIND_IN_SET(year, concatenated_years).

UPD:

Here is the version that uses GROUP_CONCAT to pass the required data to the outer JOIN.

My comments start with --newtover:. By the way, 1) I do not think STRAIGHT_JOIN adds any benefit, and 2) COUNT(*) has a special meaning in MySQL and should be used when you want to count rows.

SELECT STRAIGHT_JOIN
  -- newtover: extract the corresponding amount back
  SUBSTRING_INDEX(SUBSTRING_INDEX(GROUPED_AMOUNTS, '|', @pos),'|', -1) as AMOUNT,
  Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
  Y.YEAR as YEAR,
  MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
  ymxb.SLOPE,
  ymxb.INTERCEPT,
  ymxb.CORRELATION,
  ymxb.MEASUREMENTS
FROM
  -- newtover: list of tables now contains only the subquery, YEAR_REF for grouping and init_vars to define the variable
  YEAR_REF Y,
  (SELECT
    SUM(MEASUREMENTS) as MEASUREMENTS,
    ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
    ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
    (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
    ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
    (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION,
    -- newtover: grouped fields for matching years and the corresponding amounts
    GROUP_CONCAT(Y.YEAR) as GROUPED_YEARS,
    GROUP_CONCAT(AMOUNT SEPARATOR '|') as GROUPED_AMOUNTS
  FROM (
    SELECT STRAIGHT_JOIN
      COUNT(1) as MEASUREMENTS,
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR as YEAR
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D
    WHERE
      -- For a specific city ...
      $X{ IN, C.ID, CityCode } AND
      -- Find all the stations within a specific unit radius ...
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
      SD.ID = S.STATION_DISTRICT_ID AND
      -- Gather all known years for that station ...
      Y.STATION_DISTRICT_ID = SD.ID AND
      -- The data before 1900 is shaky; insufficient after 2009.
      Y.YEAR BETWEEN 1900 AND 2009 AND
      -- Filtered by all known months ...
      M.YEAR_REF_ID = Y.ID AND
      -- Whittled down by category ...
      M.CATEGORY_ID = $P{CategoryCode} AND
      -- Into the valid daily climate data.
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
  ) t
) ymxb,
(SELECT @pos:=NULL) as init_vars
WHERE
    -- newtover: check if the year is in the list and store the index into the variable
    @pos:=CAST(FIND_IN_SET(Y.YEAR, GROUPED_YEARS) as UNSIGNED)
GROUP BY
  Y.YEAR

回复收藏 0 原文

再可℃爱ぅ一点好了 2024-09-07 12:16:38

由于问题中的 SQL 基本上已挂起（现在仅显示相关部分），这是我的新答案

假设：条件确实相同，并且子查询和外部查询之间没有发生棘手的列别名

答案：
您可以删除外部查询中的 where 。

SELECT
  /* aggregate data */
  ymxb.*
FROM (
  SELECT
    /* similar aggregate data */
  WHERE
    /* some condition */
  GROUP BY
    YEAR
) ymxb
GROUP BY
  YEAR

这应该会给你相同的结果。

（另请注意，您可以删除内部位置并保留外部位置 - 结果应该相同，但性能可能不同）。

最后，重复 where 子句可能不会对性能产生太大影响 - 与任何 I/O 相比，计算额外条件（甚至是 sqrt 等表达式）的成本非常低（并且这些条件不会影响性能）。对任何新列进行操作，因此所有 I/O 都已完成）

此外，您的内部查询和外部查询使用相同的 GROUP BY，并且外部查询从子查询获取所有数据。

这使得外部查询中的任何聚合函数变得多余（子查询中的行（作为外部查询的源）已按年份分组）。

这使得整个子选择变得多余。

As the SQL in the question was substantially hanged (now showing only relevant parts) here is my new answer

Assumption: The condition is really the same and no tricky column aliasing occurs between subquery and outer query

Answer:
You can remove the where in the outer query.

SELECT
  /* aggregate data */
  ymxb.*
FROM (
  SELECT
    /* similar aggregate data */
  WHERE
    /* some condition */
  GROUP BY
    YEAR
) ymxb
GROUP BY
  YEAR

This should give you the same result.

(Also note that you could have removed the inner where and kept the outer one - results should be the same, however performance might not).

Finally, repeating the where clause probably does not have big impact on the performance - evaluating extra conditions (even expressions such as sqrt, etc) is very cheap compared to any I/O (and these conditions do not operate on any new columns, so all I/O had already been done)

Furthermore, your inner query and outer query use the same GROUP BY and the outer query gets it all data from subquery.

This makes any aggregate functions in the outer query redundant (the rows from the subquery, which are the source for the outer query, have already been grouped by year).

That makes the whole subselect redundant.

回复收藏 0 原文

〆一缕阳光ご 2024-09-07 12:16:38

您可以在您的情况下使用临时表吗？尽管它仍然需要您使用 WHERE 子句两次，但它应该会显着提高您的性能。

DROP TEMPORARY TABLE IF EXISTS TEMP_DATA

CREATE TEMPORARY TABLE TEMP_DATA 
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (SOME_CONDITION)
           GROUP BY a.group)
    )

SELECT t.group, func(t.group, t.avg_avg)
FROM TEMP_DATA AS t
WHERE (SOME_CONDITION)
GROUP BY t.group

希望这有帮助！
——配音

Are you able to use a temp table in your situation? Although it still requires you to use the WHERE clause twice, it should considerably boost your performance.

DROP TEMPORARY TABLE IF EXISTS TEMP_DATA

CREATE TEMPORARY TABLE TEMP_DATA 
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (SOME_CONDITION)
           GROUP BY a.group)
    )

SELECT t.group, func(t.group, t.avg_avg)
FROM TEMP_DATA AS t
WHERE (SOME_CONDITION)
GROUP BY t.group

Hope this helps!
--Dubs

回复收藏 0 原文

~没有更多了~