如何使用 Python 和 Numpy 计算 r 平方?

发布于 2024-07-22 04:36:00 字数 704 浏览 10 评论 0原文

我正在使用 Python 和 Numpy 来计算任意次数的最佳拟合多项式。 我传递 x 值、y 值和我想要拟合的多项式次数(线性、二次等)的列表。

这很有效,但我还想计算 r(相关系数)和 r 平方(确定系数)。 我将我的结果与 Excel 的最佳拟合趋势线功能及其计算的 r 平方值进行比较。 使用这个,我知道我正在正确计算线性最佳拟合的 r 平方(度等于 1)。 但是,我的函数不适用于次数大于 1 的多项式。Excel

可以做到这一点。 如何使用 Numpy 计算高阶多项式的 r 平方?

这是我的功能:

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)
     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    correlation = numpy.corrcoef(x, y)[0,1]

     # r
    results['correlation'] = correlation
     # r-squared
    results['determination'] = correlation**2

    return results

I'm using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel's best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.

Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?

Here's my function:

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)
     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    correlation = numpy.corrcoef(x, y)[0,1]

     # r
    results['correlation'] = correlation
     # r-squared
    results['determination'] = correlation**2

    return results

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

巴黎盛开的樱花 2024-07-29 04:36:00

回复很晚,但以防万一有人需要一个现成的函数:

scipy.stats.linregress

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)

如 @Adam Marples 的回答。

A very late reply, but just in case someone needs a ready function for this:

scipy.stats.linregress

i.e.

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)

as in @Adam Marples's answer.

旧伤慢歌 2024-07-29 04:36:00

来自 yanl (yet-another-library) sklearn.metrics 有一个 r2_score 函数;

from sklearn.metrics import r2_score

coefficient_of_dermination = r2_score(y, p(x))

From yanl (yet-another-library) sklearn.metrics has an r2_score function;

from sklearn.metrics import r2_score

coefficient_of_dermination = r2_score(y, p(x))
烟花肆意 2024-07-29 04:36:00

numpy.polyfit 文档中,它是拟合线性的回归。 具体来说,度为“d”的 numpy.polyfit 拟合线性回归,其平均函数

E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + .. . + p_1 * x + p_0

因此您只需计算该拟合的 R 平方。 线性回归上的维基百科页面提供了完整的详细信息。 您对 R^2 感兴趣,您可以通过多种方式计算它,最简单的可能是

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

我使用“y_bar”作为 y 的平均值,使用“y_ihat”作为每个点的拟合值。

我对 numpy 不太熟悉(我通常在 R 中工作),所以可能有一种更简洁的方法来计算 R 平方,但以下应该是正确的

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results

From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function

E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + ... + p_1 * x + p_0

So you just need to calculate the R-squared for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.

I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your R-squared, but the following should be correct

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results
壹場煙雨 2024-07-29 04:36:00

我已经成功地使用了它,其中 x 和 y 是类似数组的。

注意:仅适用于线性回归

def rsquared(x, y):
    """ Return R^2 where x and y are array-like."""

    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
    return r_value**2

I have been using this successfully, where x and y are array-like.

Note: for linear regression only

def rsquared(x, y):
    """ Return R^2 where x and y are array-like."""

    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
    return r_value**2
夜唯美灬不弃 2024-07-29 04:36:00

我最初发布下面的基准测试的目的是推荐 numpy.corrcoef,愚蠢地没有意识到原来的问题已经使用了 corrcoef 并且实际上是在询问高阶多项式拟合。 我使用 statsmodels 添加了多项式 r 平方问题的实际解决方案,并且保留了原始基准,虽然偏离主题,但对某人可能有用。


statsmodels 能够直接计算多项式拟合的 r^2,这里有 2 种方法...

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Construct the columns for the different powers of x
def get_r2_statsmodels(x, y, k=1):
    xpoly = np.column_stack([x**i for i in range(k+1)])    
    return sm.OLS(y, xpoly).fit().rsquared

# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula(x, y, k=1):
    formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
    data = {'x': x, 'y': y}
    return smf.ols(formula, data).fit().rsquared # or rsquared_adj

要进一步利用 statsmodels ,还应该查看拟合模型摘要,它可以在 Jupyter/IPython 笔记本中打印或显示为丰富的 HTML 表格。 除了 rsquared 之外,结果对象还提供对许多有用统计指标的访问。

model = sm.OLS(y, xpoly)
results = model.fit()
results.summary()

下面是我原来的答案,我对各种线性回归 r^2 方法进行了基准测试...

corrcoef函数计算相关系数r,仅适用于单个线性回归,因此它没有解决r^2的问题 用于高阶多项式拟合。 然而,无论如何,我发现对于线性回归来说,它确实是计算 r 的最快、最直接的方法。

def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2

这些是我通过比较 1000 个随机 (x, y) 点的一堆方法得到的 timeit 结果:

  • 纯 Python(直接 r 计算)
    • 1000 次循环,3 次最佳:每次循环 1.59 毫秒
  • Numpy polyfit(适用于 n 次多项式拟合)
    • 1000 个循环,3 次最佳:每个循环 326 µs
  • Numpy 手册(直接 r 计算)
    • 10000 个循环,3 次最佳:每个循环 62.1 µs
  • Numpy corrcoef(直接 r 计算)
    • 10000 个循环,3 次最佳:每个循环 56.6 µs
  • Scipy(以 r 作为输出的线性回归)
    • 1000 个循环,3 次最佳:每个循环 676 µs
  • Statsmodels(可以进行 n 次多项式和许多其他拟合)
    • 1000 个循环,3 次最佳:每个循环 422 µs

corrcoef 方法以微弱优势胜过使用 numpy 方法“手动”计算 r^2。 它比 polyfit 方法快 5 倍以上,比 scipy.linregress 快约 12 倍。 为了强调 numpy 为您所做的事情,它比纯 python 快 28 倍。 我不太熟悉 numba 和 pypy 之类的东西,所以其他人必须填补这些空白,但我认为这足以让我相信 corrcoef 是计算 的最佳工具>r 用于简单的线性回归。

这是我的基准测试代码。 我从 Jupyter Notebook 复制粘贴(很难不称其为 IPython Notebook...),所以如果途中出现任何问题,我深表歉意。 %timeit 魔术命令需要 IPython。

import numpy as np
from scipy import stats
import statsmodels.api as sm
import math

n=1000
x = np.random.rand(1000)*10
x.sort()
y = 10 * x + (5+np.random.randn(1000)*10-5)

x_list = list(x)
y_list = list(y)

def get_r2_numpy(x, y):
    slope, intercept = np.polyfit(x, y, 1)
    r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
    return r_squared
    
def get_r2_scipy(x, y):
    _, _, r_value, _, _ = stats.linregress(x, y)
    return r_value**2
    
def get_r2_statsmodels(x, y):
    return sm.OLS(y, sm.add_constant(x)).fit().rsquared
    
def get_r2_python(x_list, y_list):
    n = len(x_list)
    x_bar = sum(x_list)/n
    y_bar = sum(y_list)/n
    x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
    y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
    zx = [(xi-x_bar)/x_std for xi in x_list]
    zy = [(yi-y_bar)/y_std for yi in y_list]
    r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
    return r**2
    
def get_r2_numpy_manual(x, y):
    zx = (x-np.mean(x))/np.std(x, ddof=1)
    zy = (y-np.mean(y))/np.std(y, ddof=1)
    r = np.sum(zx*zy)/(len(x)-1)
    return r**2
    
def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2
    
print('Python')
%timeit get_r2_python(x_list, y_list)
print('Numpy polyfit')
%timeit get_r2_numpy(x, y)
print('Numpy Manual')
%timeit get_r2_numpy_manual(x, y)
print('Numpy corrcoef')
%timeit get_r2_numpy_corrcoef(x, y)
print('Scipy')
%timeit get_r2_scipy(x, y)
print('Statsmodels')
%timeit get_r2_statsmodels(x, y)

21 年 7 月 28 日基准测试结果。 (Python 3.7、numpy 1.19、scipy 1.6、statsmodels 0.12)

Python
2.41 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy polyfit
318 µs ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy Manual
79.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy corrcoef
83.8 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Scipy
221 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Statsmodels
375 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I originally posted the benchmarks below with the purpose of recommending numpy.corrcoef, foolishly not realizing that the original question already uses corrcoef and was in fact asking about higher order polynomial fits. I've added an actual solution to the polynomial r-squared question using statsmodels, and I've left the original benchmarks, which while off-topic, are potentially useful to someone.


statsmodels has the capability to calculate the r^2 of a polynomial fit directly, here are 2 methods...

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Construct the columns for the different powers of x
def get_r2_statsmodels(x, y, k=1):
    xpoly = np.column_stack([x**i for i in range(k+1)])    
    return sm.OLS(y, xpoly).fit().rsquared

# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula(x, y, k=1):
    formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
    data = {'x': x, 'y': y}
    return smf.ols(formula, data).fit().rsquared # or rsquared_adj

To further take advantage of statsmodels, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared.

model = sm.OLS(y, xpoly)
results = model.fit()
results.summary()

Below is my original Answer where I benchmarked various linear regression r^2 methods...

The corrcoef function used in the Question calculates the correlation coefficient, r, only for a single linear regression, so it doesn't address the question of r^2 for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r.

def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2

These were my timeit results from comparing a bunch of methods for 1000 random (x, y) points:

  • Pure Python (direct r calculation)
    • 1000 loops, best of 3: 1.59 ms per loop
  • Numpy polyfit (applicable to n-th degree polynomial fits)
    • 1000 loops, best of 3: 326 µs per loop
  • Numpy Manual (direct r calculation)
    • 10000 loops, best of 3: 62.1 µs per loop
  • Numpy corrcoef (direct r calculation)
    • 10000 loops, best of 3: 56.6 µs per loop
  • Scipy (linear regression with r as an output)
    • 1000 loops, best of 3: 676 µs per loop
  • Statsmodels (can do n-th degree polynomial and many other fits)
    • 1000 loops, best of 3: 422 µs per loop

The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef is the best tool for calculating r for a simple linear regression.

Here's my benchmarking code. I copy-pasted from a Jupyter Notebook (hard not to call it an IPython Notebook...), so I apologize if anything broke on the way. The %timeit magic command requires IPython.

import numpy as np
from scipy import stats
import statsmodels.api as sm
import math

n=1000
x = np.random.rand(1000)*10
x.sort()
y = 10 * x + (5+np.random.randn(1000)*10-5)

x_list = list(x)
y_list = list(y)

def get_r2_numpy(x, y):
    slope, intercept = np.polyfit(x, y, 1)
    r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
    return r_squared
    
def get_r2_scipy(x, y):
    _, _, r_value, _, _ = stats.linregress(x, y)
    return r_value**2
    
def get_r2_statsmodels(x, y):
    return sm.OLS(y, sm.add_constant(x)).fit().rsquared
    
def get_r2_python(x_list, y_list):
    n = len(x_list)
    x_bar = sum(x_list)/n
    y_bar = sum(y_list)/n
    x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
    y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
    zx = [(xi-x_bar)/x_std for xi in x_list]
    zy = [(yi-y_bar)/y_std for yi in y_list]
    r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
    return r**2
    
def get_r2_numpy_manual(x, y):
    zx = (x-np.mean(x))/np.std(x, ddof=1)
    zy = (y-np.mean(y))/np.std(y, ddof=1)
    r = np.sum(zx*zy)/(len(x)-1)
    return r**2
    
def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2
    
print('Python')
%timeit get_r2_python(x_list, y_list)
print('Numpy polyfit')
%timeit get_r2_numpy(x, y)
print('Numpy Manual')
%timeit get_r2_numpy_manual(x, y)
print('Numpy corrcoef')
%timeit get_r2_numpy_corrcoef(x, y)
print('Scipy')
%timeit get_r2_scipy(x, y)
print('Statsmodels')
%timeit get_r2_statsmodels(x, y)

7/28/21 Benchmark results. (Python 3.7, numpy 1.19, scipy 1.6, statsmodels 0.12)

Python
2.41 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy polyfit
318 µs ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy Manual
79.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy corrcoef
83.8 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Scipy
221 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Statsmodels
375 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
逆蝶 2024-07-29 04:36:00

这是一个使用 Python 和 Numpy 计算加权 r 平方的函数(大部分代码来自 sklearn):

from __future__ import division 
import numpy as np

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

示例:

from __future__ import print_function, division 
import sklearn.metrics 

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse    

def compute_r2(y_true, y_predicted):
    sse = sum((y_true - y_predicted)**2)
    tse = (len(y_true) - 1) * np.var(y_true, ddof=1)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

def main():
    '''
    Demonstrate the use of compute_r2_weighted() and checks the results against sklearn
    '''        
    y_true = [3, -0.5, 2, 7]
    y_pred = [2.5, 0.0, 2, 8]
    weight = [1, 5, 1, 2]
    r2_score = sklearn.metrics.r2_score(y_true, y_pred)
    print('r2_score: {0}'.format(r2_score))  
    r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred))
    print('r2_score: {0}'.format(r2_score))
    r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight)
    print('r2_score weighted: {0}'.format(r2_score))
    r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight))
    print('r2_score weighted: {0}'.format(r2_score))

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

输出:

r2_score: 0.9486081370449679
r2_score: 0.9486081370449679
r2_score weighted: 0.9573170731707317
r2_score weighted: 0.9573170731707317

这对应于 公式 (镜像):

在此处输入图像描述

其中 f_i 是拟合的预测值,y_{av} 是观测数据的平均值,y_i 是观测数据值。 w_i是应用于每个数据点的权重,通常w_i=1。 SSE 是误差平方和,SST 是总平方和。


如果有兴趣,R中的代码: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f镜像

Here is a function to compute the weighted r-squared with Python and Numpy (most of the code comes from sklearn):

from __future__ import division 
import numpy as np

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

Example:

from __future__ import print_function, division 
import sklearn.metrics 

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse    

def compute_r2(y_true, y_predicted):
    sse = sum((y_true - y_predicted)**2)
    tse = (len(y_true) - 1) * np.var(y_true, ddof=1)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

def main():
    '''
    Demonstrate the use of compute_r2_weighted() and checks the results against sklearn
    '''        
    y_true = [3, -0.5, 2, 7]
    y_pred = [2.5, 0.0, 2, 8]
    weight = [1, 5, 1, 2]
    r2_score = sklearn.metrics.r2_score(y_true, y_pred)
    print('r2_score: {0}'.format(r2_score))  
    r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred))
    print('r2_score: {0}'.format(r2_score))
    r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight)
    print('r2_score weighted: {0}'.format(r2_score))
    r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight))
    print('r2_score weighted: {0}'.format(r2_score))

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

outputs:

r2_score: 0.9486081370449679
r2_score: 0.9486081370449679
r2_score weighted: 0.9573170731707317
r2_score weighted: 0.9573170731707317

This corresponds to the formula (mirror):

enter image description here

with f_i is the predicted value from the fit, y_{av} is the mean of the observed data y_i is the observed data value. w_i is the weighting applied to each data point, usually w_i=1. SSE is the sum of squares due to error and SST is the total sum of squares.


If interested, the code in R: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f (mirror)

无戏配角 2024-07-29 04:36:00

这是一个非常简单的 python 函数,假设 y 和 y_hat 是 pandas 系列,则根据实际值和预测值计算 R^2:

def r_squared(y, y_hat):
    y_bar = y.mean()
    ss_tot = ((y-y_bar)**2).sum()
    ss_res = ((y-y_hat)**2).sum()
    return 1 - (ss_res/ss_tot)

Here's a very simple python function to compute R^2 from the actual and predicted values assuming y and y_hat are pandas series:

def r_squared(y, y_hat):
    y_bar = y.mean()
    ss_tot = ((y-y_bar)**2).sum()
    ss_res = ((y-y_hat)**2).sum()
    return 1 - (ss_res/ss_tot)
丶视觉 2024-07-29 04:36:00

R 平方是一种仅适用于线性回归的统计量。

本质上,它测量线性回归可以解释数据中的变化程度。

因此,您计算“总平方和”,即每个结果变量与其平均值的总平方偏差。 。 。

formula1

其中 y_bar 是 y 的平均值。

然后,您计算“回归平方和”,即您的 FITTED 值与平均值的差异

formula2

并找出两者的比率。

现在,对于多项式拟合,您所要做的就是插入该模型中的 y_hat,但称其为 r 平方并不准确。

这里是我找到的链接这有点说明问题了。

R-squared is a statistic that only applies to linear regression.

Essentially, it measures how much variation in your data can be explained by the linear regression.

So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .

formula1

where y_bar is the mean of the y's.

Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean

formula2

and find the ratio of those two.

Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.

Here is a link I found that speaks to it a little.

猥琐帝 2024-07-29 04:36:00

维基百科关于 r-squareds 的文章表明它可以用于一般模型拟合而不仅仅是线性回归。

The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.

深海少女心 2024-07-29 04:36:00

使用 numpy 模块(在 python3 中测试):

import numpy as np
def linear_regression(x, y): 
    coefs = np.polynomial.polynomial.polyfit(x, y, 1)
    ffit = np.poly1d(coefs)
    m = ffit[0]
    b = ffit[1] 
    eq = 'y = {}x + {}'.format(round(m, 3), round(b, 3))
    rsquared = np.corrcoef(x, y)[0, 1]**2
    return rsquared, eq, m, b

rsquared, eq, m, b = linear_regression(x,y)
print(rsquared, m, b)
print(eq)

输出:

<预><代码>0.013378252355751777 0.1316331351105754 0.7928782850418713
y = 0.132x + 0.793

注:r² ≠ R²
r² 称为“决定系数”
R² 是皮尔逊系数的平方

R²,正式合并为 r²,可能就是您想要的,因为它是最小二乘拟合,这比 r² 的简单分数和更好。 Numpy 不怕称其为“corrcoef”,它假定 Pearson 是事实上的相关系数。

Using the numpy module (tested in python3):

import numpy as np
def linear_regression(x, y): 
    coefs = np.polynomial.polynomial.polyfit(x, y, 1)
    ffit = np.poly1d(coefs)
    m = ffit[0]
    b = ffit[1] 
    eq = 'y = {}x + {}'.format(round(m, 3), round(b, 3))
    rsquared = np.corrcoef(x, y)[0, 1]**2
    return rsquared, eq, m, b

rsquared, eq, m, b = linear_regression(x,y)
print(rsquared, m, b)
print(eq)

Output:

0.013378252355751777 0.1316331351105754 0.7928782850418713 
y = 0.132x + 0.793

Note: r² ≠ R²
r² is called the "Coefficient of Determination"
R² is the square of the Pearson Coefficient

R², officially conflated as r², is probably the one you want, as it's a least-square fit, which is better than the simple fraction of sums that r² is. Numpy is not afraid to call it "corrcoef", which presupposes Pearson is the de-facto correlation coefficient.

a√萤火虫的光℡ 2024-07-29 04:36:00

您可以直接执行此代码,这将为您找到多项式,并为您找到R值如果您需要更多解释,可以在下面发表评论。

from scipy.stats import linregress
import numpy as np

x = np.array([1,2,3,4,5,6])
y = np.array([2,3,5,6,7,8])

p3 = np.polyfit(x,y,3) # 3rd degree polynomial, you can change it to any degree you want
xp = np.linspace(1,6,6)  # 6 means the length of the line
poly_arr = np.polyval(p3,xp)

poly_list = [round(num, 3) for num in list(poly_arr)]
slope, intercept, r_value, p_value, std_err = linregress(x, poly_list)
print(r_value**2)

You can execute this code directly, this will find you the polynomial, and will find you the R-value you can put a comment down below if you need more explanation.

from scipy.stats import linregress
import numpy as np

x = np.array([1,2,3,4,5,6])
y = np.array([2,3,5,6,7,8])

p3 = np.polyfit(x,y,3) # 3rd degree polynomial, you can change it to any degree you want
xp = np.linspace(1,6,6)  # 6 means the length of the line
poly_arr = np.polyval(p3,xp)

poly_list = [round(num, 3) for num in list(poly_arr)]
slope, intercept, r_value, p_value, std_err = linregress(x, poly_list)
print(r_value**2)
热风软妹 2024-07-29 04:36:00

来自 scipy.stats.linregress 来源。 他们使用平均平方和方法。

import numpy as np

x = np.array(x)
y = np.array(y)

# average sum of squares:
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat

r_num = ssxym
r_den = np.sqrt(ssxm * ssym)
r = r_num / r_den

if r_den == 0.0:
    r = 0.0
else:
    r = r_num / r_den

    if r > 1.0:
        r = 1.0
    elif r < -1.0:
        r = -1.0

From scipy.stats.linregress source. They use the average sum of squares method.

import numpy as np

x = np.array(x)
y = np.array(y)

# average sum of squares:
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat

r_num = ssxym
r_den = np.sqrt(ssxm * ssym)
r = r_num / r_den

if r_den == 0.0:
    r = 0.0
else:
    r = r_num / r_den

    if r > 1.0:
        r = 1.0
    elif r < -1.0:
        r = -1.0
薔薇婲 2024-07-29 04:36:00

您可以直接在 numpy==1.26 中执行此操作,无需任何额外的库,只需传递 full=True

np.polyfit(xs, ys, 1, full=True)

示例:

# make 5 (x, y) points with fake variation
xs = np.linspace(0, 1, 5)
ys = np.array([0, 0.22, 0.55, 0.70, 1])

# ss_r: residuals – sum of squared residuals of the least squares fit.
#       ss_r will be 1D array if `ys` is 1D.
(m, b), ss_r, _, _, _ = np.polyfit(xs, ys, 1, full=True)
m, b, ss_r

(0.9920000000000001 -0.001999999999999821 array([0.00568] ))

# calculate line via `polyfit` then total sum of squares
# y = mx + b
r_ys = xs*m + b
ss_tot = np.sum((r_ys - np.mean(r_ys)) ** 2)

r_squared = 1 - (ss_r / ss_tot)
r_squared

数组([0.99630472])

you can do it directly in numpy==1.26 without any extra libraries by passing full=True:

np.polyfit(xs, ys, 1, full=True)

Example:

# make 5 (x, y) points with fake variation
xs = np.linspace(0, 1, 5)
ys = np.array([0, 0.22, 0.55, 0.70, 1])

# ss_r: residuals – sum of squared residuals of the least squares fit.
#       ss_r will be 1D array if `ys` is 1D.
(m, b), ss_r, _, _, _ = np.polyfit(xs, ys, 1, full=True)
m, b, ss_r

(0.9920000000000001 -0.001999999999999821 array([0.00568]))

# calculate line via `polyfit` then total sum of squares
# y = mx + b
r_ys = xs*m + b
ss_tot = np.sum((r_ys - np.mean(r_ys)) ** 2)

r_squared = 1 - (ss_r / ss_tot)
r_squared

array([0.99630472])

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文