删除多余的SQL代码
代码
以下代码计算针对大量数据的线性回归的斜率和截距。然后,它对同一结果集应用方程 y = mx + b 来计算每行的回归线值。
如何连接两个查询以便在不执行两次 WHERE
子句的情况下计算数据及其斜率/截距?
问题的一般形式是:
SELECT a.group, func(a.group, avg_avg)
FROM a
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (SOME_CONDITION)
GROUP BY a.group) as several_lines -- potentially
) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines
我有 SOME_CONDITION
执行了两次。如下所示(使用 STRAIGHT_JOIN
优化进行更新):
SELECT STRAIGHT_JOIN
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
ymxb.SLOPE,
ymxb.INTERCEPT,
ymxb.CORRELATION,
ymxb.MEASUREMENTS
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D,
(SELECT
SUM(MEASUREMENTS) as MEASUREMENTS,
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
(stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
SELECT STRAIGHT_JOIN
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
) t
) ymxb
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
问题
如何在每个查询中仅执行一次重复位,而不是两次?重复的代码:
$X{ IN, C.ID, CityCode } AND
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
Y.STATION_DISTRICT_ID = SD.ID AND
Y.YEAR BETWEEN 1900 AND 2009 AND
M.YEAR_REF_ID = Y.ID AND
M.CATEGORY_ID = $P{CategoryCode} AND
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
更新 1
使用变量并拆分查询似乎允许缓存启动,因为现在运行时间为 3.5 秒,而过去运行时间为 7 秒。不过,如果有什么办法的话要删除重复的代码,我将不胜感激任何帮助。
<罢工> 更新 2
上面的代码不能在 JasperReports 中运行,而 VIEW 虽然是一个可能的修复方法,但可能效率极低(因为 WHERE 子句是参数化的)。
更新 3
使用 Unreason 建议的具有收敛子午线的毕达哥拉斯公式来验证距离:(
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )
这与问题无关,但应该有人想知道......)
更新 4
如图所示,代码在 JasperReports 中运行,针对 MySQL 数据库运行。 JasperReports 不允许变量或多个查询。
更新 5
我正在寻找一个干净执行的解决方案。 ;-) 我已经编写了许多部分工作的解决方案,但遗憾的是 MySQL 无法理解部分正确。请参阅与 Unreason 的讨论以获得几乎有效的答案。
更新 6
我也许能够重用第一个 WHERE
子句中的变量,并将它们与第二个子句进行比较(从而消除一些重复 - 检查针对 $P{}
值),但我真的希望消除重复。
更新 7
按照先前更新中的假设,比较 YEAR
子句以消除重复的 BETWEEN
是行不通的。
相关
谢谢!
Code
The following code calculates the slope and intercept for a linear regression against a slathering of data. It then applies the equation y = mx + b
against the same result set to calculate the value of the regression line for each row.
How can the two queries be joined so that the data and its slope/intercept are calculated without executing the WHERE
clause twice?
The general form of the problem is:
SELECT a.group, func(a.group, avg_avg)
FROM a
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (SOME_CONDITION)
GROUP BY a.group) as several_lines -- potentially
) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines
I have SOME_CONDITION
executing twice. This is shown below (updated with a STRAIGHT_JOIN
optimization):
SELECT STRAIGHT_JOIN
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
ymxb.SLOPE,
ymxb.INTERCEPT,
ymxb.CORRELATION,
ymxb.MEASUREMENTS
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D,
(SELECT
SUM(MEASUREMENTS) as MEASUREMENTS,
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
(stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
SELECT STRAIGHT_JOIN
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
) t
) ymxb
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Question
How do I execute the duplicate bits only once per query, instead of twice? The duplicate code:
$X{ IN, C.ID, CityCode } AND
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
Y.STATION_DISTRICT_ID = SD.ID AND
Y.YEAR BETWEEN 1900 AND 2009 AND
M.YEAR_REF_ID = Y.ID AND
M.CATEGORY_ID = $P{CategoryCode} AND
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update 1
Using variables and splitting the query seems to allow the cache to kick in as this now runs in 3.5 seconds, whereas it used to run in 7. Still, if there is any way to remove the duplicate code, I'd be grateful for any help.
Update 2
The above code does not run in JasperReports, and a VIEW, while a possible fix, would probably be extremely inefficient (because the WHERE clauses are parameterized).
Update 3
Validating distance using Unreason's suggestion of the Pythagorean formula with converging meridians:
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )
(This is unrelated to the question, but should someone else want to know ...)
Update 4
The code, as shown, works in JasperReports, running against a MySQL database. JasperReports does not allow variables or multiple queries.
Update 5
Am looking for a solution that executes cleanly. ;-) I have written a number of partially working solutions, but MySQL, sadly, does not understand partially correct. See the discussions with Unreason for answers that almost work.
Update 6
I might be able to reuse variables from the first WHERE
clause and compare them to the second (thereby eliminating some duplication -- the checks against $P{}
values), but I'd really like the duplication eliminated.
Update 7
Comparing the YEAR
clause, as hypothesized in the previous update, to eliminate the duplicate BETWEEN
, does not work.
Related
How to eliminate duplicate calculation in SQL?
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您应该能够一次性获得所需的一切:
直接从上面的查询中将无法工作(它具有无意义的组合聚合和其他错误);这可能是检查公式的好时机
如果您决定进行子查询并简化公式,那么:
You should be able to get everything you need in one go:
The things will not work straight from the query above (it has nonsensically combined aggregates and other errors); this can be a good time to check your formulas
If you decide to do sub queries do simplify the formulas, then:
这个问题比你的概括要困难一些。我将其表述如下:
您有一个数据子集(受您的条件限制),它被分组并为每个组进行聚合。然后,您将聚合向下合并为单个值,并希望再次将该值的函数应用于每个组。显然,在分组子查询的结果可以作为实体引用之前,您不能重用条件。
在 MSSQL 和 Oracle 中,您将使用
WITH
运算符。在 MySQL 中,唯一的选择是使用临时表。我假设你的报告中有一年以上的内容(否则查询会简单得多)。UPD:抱歉,我现在无法发布准备好的代码(明天可以),但我有一个想法:
您可以将子查询中需要输出的数据与
连接起来GROUP_CONCAT
并使用FIND_IN_SET
和SUBSTRING_INDEX
函数将其拆分回外部查询中。外部查询将仅 JOIN YEAR_REF 和聚合结果。外部查询中的条件将只是
WHERE FIND_IN_SET(year, concatenated_years)
。UPD:
这是使用GROUP_CONCAT将所需数据传递到外部JOIN的版本。
我的评论以
--newtover:
开头。顺便说一句,1)我认为 STRAIGHT_JOIN 不会带来任何好处,2)COUNT(*)
在 MySQL 中具有特殊含义,当您想要时应该使用计算行数。The problem is a bit more difficult than in your generalization. I would state it as the following:
You have a subset of data (limited by your condition), which is grouped and an aggregation is made for each group. Then, you merge down aggregations to a single value and you want to apply a function of the value to each group again. Obviously, you can not reuse the condition until the result of the grouped subquery can be referenced as an entity.
In MSSQL and Oracle, you would use
WITH
operator. In MySQL the only option is to use a temporary table. I assume that there is more than one year in your report (otherwise, the query would be much simplier).UPD: I am sorry, I can not post the ready code now (can do it tomorrow), but I have an idea:
You can concatenate the data you need to output in the subquery with
GROUP_CONCAT
AND split it back in the outer query withFIND_IN_SET
, andSUBSTRING_INDEX
functions. the outer query will JOIN only the YEAR_REF and the result of the aggregation.The condition in the outer query then will be just
WHERE FIND_IN_SET(year, concatenated_years)
.UPD:
Here is the version that uses GROUP_CONCAT to pass the required data to the outer JOIN.
My comments start with
--newtover:
. By the way, 1) I do not think STRAIGHT_JOIN adds any benefit, and 2)COUNT(*)
has a special meaning in MySQL and should be used when you want to count rows.由于问题中的 SQL 基本上已挂起(现在仅显示相关部分),这是我的新答案
假设:条件确实相同,并且子查询和外部查询之间没有发生棘手的列别名
答案:
您可以删除外部查询中的 where 。
这应该会给你相同的结果。
(另请注意,您可以删除内部位置并保留外部位置 - 结果应该相同,但性能可能不同)。
最后,重复 where 子句可能不会对性能产生太大影响 - 与任何 I/O 相比,计算额外条件(甚至是 sqrt 等表达式)的成本非常低(并且这些条件不会影响性能)。对任何新列进行操作,因此所有 I/O 都已完成)
此外,您的内部查询和外部查询使用相同的 GROUP BY,并且外部查询从子查询获取所有数据。
这使得外部查询中的任何聚合函数变得多余(子查询中的行(作为外部查询的源)已按年份分组)。
这使得整个子选择变得多余。
As the SQL in the question was substantially hanged (now showing only relevant parts) here is my new answer
Assumption: The condition is really the same and no tricky column aliasing occurs between subquery and outer query
Answer:
You can remove the where in the outer query.
This should give you the same result.
(Also note that you could have removed the inner where and kept the outer one - results should be the same, however performance might not).
Finally, repeating the where clause probably does not have big impact on the performance - evaluating extra conditions (even expressions such as sqrt, etc) is very cheap compared to any I/O (and these conditions do not operate on any new columns, so all I/O had already been done)
Furthermore, your inner query and outer query use the same GROUP BY and the outer query gets it all data from subquery.
This makes any aggregate functions in the outer query redundant (the rows from the subquery, which are the source for the outer query, have already been grouped by year).
That makes the whole subselect redundant.
您可以在您的情况下使用临时表吗?尽管它仍然需要您使用 WHERE 子句两次,但它应该会显着提高您的性能。
希望这有帮助!
——配音
Are you able to use a temp table in your situation? Although it still requires you to use the WHERE clause twice, it should considerably boost your performance.
Hope this helps!
--Dubs