最优二变量线性回归计算
问题
我希望应用y = mx + b
方程(其中m是SLOPE
,b是INTERCEPT
)到一个数据集,如 SQL 代码所示检索该数据集。 (MySQL) 查询的值是:
SLOPE = 0.0276653965651912
INTERCEPT = -57.2338357550468
SQL 代码
SELECT
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
FROM
(SELECT
D.AMOUNT,
Y.YEAR
FROM
CITY C, STATION S, YEAR_REF Y, MONTH_REF M, DAILY D
WHERE
-- For a specific city ...
--
C.ID = 8590 AND
-- Find all the stations within a 15 unit radius ...
--
SQRT( POW( C.LATITUDE - S.LATITUDE, 2 ) + POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND
-- Gather all known years for that station ...
--
S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '001' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY Y.YEAR
ORDER BY Y.YEAR
) t
问题
以下结果(计算线的起点和终点)显示不正确。为什么结果相差约 10 度(例如,异常值使数据发生偏差)?
(1900 * 0.0276653965651912) + (-57.2338357550468) = -4.66958228
(2009 * 0.0276653965651912) + (-57.2338357550468) = -1.65405406
(请注意,数据不再与图像匹配;代码。)
我预计 1900 年的结果约为 10(不是 -4.67),2009 年的结果约为 11.50(不是 -1.65) )。
相关网站
Problem
Am looking to apply the y = mx + b
equation (where m is SLOPE
, b is INTERCEPT
) to a data set, which is retrieved as shown in the SQL code. The values from the (MySQL) query are:
SLOPE = 0.0276653965651912
INTERCEPT = -57.2338357550468
SQL Code
SELECT
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
FROM
(SELECT
D.AMOUNT,
Y.YEAR
FROM
CITY C, STATION S, YEAR_REF Y, MONTH_REF M, DAILY D
WHERE
-- For a specific city ...
--
C.ID = 8590 AND
-- Find all the stations within a 15 unit radius ...
--
SQRT( POW( C.LATITUDE - S.LATITUDE, 2 ) + POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND
-- Gather all known years for that station ...
--
S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '001' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY Y.YEAR
ORDER BY Y.YEAR
) t
Question
The following results (to calculate the start and end points of the line) appear incorrect. Why are the results off by ~10 degrees (e.g., outliers skewing the data)?
(1900 * 0.0276653965651912) +
(-57.2338357550468) = -4.66958228(2009 * 0.0276653965651912) +
(-57.2338357550468) = -1.65405406
(Note that the data no longer match the image; the code.)
I would have expected the 1900 result to be around 10 (not -4.67) and the 2009 result to be around 11.50 (not -1.65).
Related Sites
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试拆分函数,您错误计算了参数。请查看此处以供参考。
我会做类似下面的事情(请原谅我不太记得 SQL 语法和临时变量,所以代码实际上可能是错误的):
Try to split up the function, you have miscalculated the parameters. Have a look here for reference.
I would do something like the following (please excuse the fact that I don't remember much about SQL syntax and temporary variables, so the code might actually be wrong):
现在已验证其正确性:
有关斜率、截距和(皮尔逊)相关性的详细信息,请参阅图像。
This has now been verified as correct:
See the image for details on slope, intercept, and (Pearson's) correlation.