使用多项式/最小二乘回归预测值
我有一个包含 2 个变量(称为 x,形状为 nx 2 个 x1 和 x2 值)和 1 个输出(称为 y)的数据集。我无法理解如何根据多项式特征和权重计算预测输出值。我的理解是 y = X dot w,其中 X 是多项式特征,w 是权重。 多项式特征是使用 sklearn.preprocessing 中的 PolynomialFeatures 生成的。权重是从 np.linalg.lstsq 生成的。下面是我为此创建的示例代码。
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame()
df['x1'] = [1,2,3,4,5]
df['x2'] = [11,12,13,14,15]
df['y'] = [75,96,136,170,211]
x = np.array([df['x1'],df['x2']]).T
y = np.array(df['y']).reshape(-1,1)
poly = PolynomialFeatures(interaction_only=False, include_bias=True)
poly_features = poly.fit_transform(x)
print(poly_features)
w = np.linalg.lstsq(x,y)
weight_list = []
for item in w:
if type(item) is np.int32:
weight_list.append(item)
continue
for weight in item:
if type(weight) is np.ndarray:
weight_list.append(weight[0])
continue
weight_list.append(weight)
weight_list
y_pred = np.dot(poly_features, weight_list)
print(y_pred)
regression_model = LinearRegression()
regression_model.fit(x,y)
y_predicted = regression_model.predict(x)
print(y_predicted)
对于 y_pred 值,它们与我创建的值列表相差甚远。我是否对 np.linalg.lstsq 使用了错误的输入,我的理解是否存在错误?
使用内置的 LinearRegression() 函数,y_predicted 更接近我提供的 y 值。 y_pred 的数量级要高得多。
I have a dataset of 2 variables (called x with shape n x 2 values of x1 and x2) and 1 output (called y). I am having trouble understanding how to calculate predicted output values from the polynomial features as well as weights. My understanding is that y = X dot w, where X are the polynomial features and w are the weights.
The polynomial features were generated using PolynomialFeatures from sklearn.preprocessing. The weights were generated from np.linalg.lstsq. Below is a sample code that I created for this.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame()
df['x1'] = [1,2,3,4,5]
df['x2'] = [11,12,13,14,15]
df['y'] = [75,96,136,170,211]
x = np.array([df['x1'],df['x2']]).T
y = np.array(df['y']).reshape(-1,1)
poly = PolynomialFeatures(interaction_only=False, include_bias=True)
poly_features = poly.fit_transform(x)
print(poly_features)
w = np.linalg.lstsq(x,y)
weight_list = []
for item in w:
if type(item) is np.int32:
weight_list.append(item)
continue
for weight in item:
if type(weight) is np.ndarray:
weight_list.append(weight[0])
continue
weight_list.append(weight)
weight_list
y_pred = np.dot(poly_features, weight_list)
print(y_pred)
regression_model = LinearRegression()
regression_model.fit(x,y)
y_predicted = regression_model.predict(x)
print(y_predicted)
With the y_pred values, they are nowhere near the list of values that I created. Am I using the incorrect inputs for np.linalg.lstsq, is there a lapse in my understanding?
Using the built-in LinearRegression() function, the y_predicted is much closer to my provided y-values. The y_pred is orders of magnitude much higher.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 lstsq 函数中,生成的多项式特征应该是第一个输入,而不是最初提供的 x 数据。
此外,lstsq 第一个返回的输出是回归系数/权重,可以通过索引 0 来访问。
使用最小二乘回归权重/系数的显式线性代数方法的校正代码将是:
对于整个正确代码(注该方法实际上比默认的 LinearRegression 函数对于预测值更准确):
In the lstsq function, the polynomial features that were generated should be the first input, not the x-data that is initially supplied.
Additionally, the first returned output of lstsq are the regression coefficients/weights, which can be accessed by indexing 0.
The corrected code using this explicit linear algebra method of least-squares regression weights/coefficients would be:
For the entire correct code (note that this method is actually more accurate for predicted values than the default LinearRegression function):