回归系数与条件均值不匹配
您可以从此repo 下载以下数据集。
Y | CONST | T | X1 | X1T | X2 | X2T | |
---|---|---|---|---|---|---|---|
0 | 2.31252 | 1 | 1 1 | 0 | 0 | 1 | 1 1 1 |
1 | -0.836074 | 1 | 1 | 1 1 | 1 | 1 1 1 1 1 1 | 1 |
1 | 1 1 1 1 1 1 1 0 0 0 | 0 | 0 0 | 0 | 0 | 1 | 0 |
I具有因数(y
)和三个二进制列(t
,x1
和x2
)。从此数据中,我们可以创建四个组:
x1 == 0
和x2 == 0
x1 == 0
和x2 ================================= 1
x1 == 1
和x2 == 0
x1 == 1
和x2 == 1
在每个组中,我想计算使用y
的均值差异,其中t == 1
和t == 0
。
我可以使用以下代码这样做:
# Libraries
import pandas as pd
# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()
# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']
# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']
> X1 X2
> 0 0 0.116175
> 1 0.168791
> 1 0 -0.027278
> 1 -0.147601
。
我想通过以下回归模型(m
)重新创建这些结果
# Libraries
from statsmodels.api import OLS
# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()
# Estimated values
m.params[['T','X1T','X2T']]
> T 0.162198
> X1T -0.230372
> X2T -0.034303
我期望的系数:
t
= 0.116175t + x1t
= 0.168791- = -0.027278
t + x2t
/code> = -0.147601
问题
为什么回归系数不匹配第一个块的输出(t ['teq1'] - t ['teq0']
)?
You can download the following data set from this repo.
Y | CONST | T | X1 | X1T | X2 | X2T | |
---|---|---|---|---|---|---|---|
0 | 2.31252 | 1 | 1 | 0 | 0 | 1 | 1 |
1 | -0.836074 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | -0.797183 | 1 | 0 | 0 | 0 | 1 | 0 |
I have a dependent variable (Y
) and three binary columns (T
, X1
and X2
). From this data we can create four groups:
X1 == 0
andX2 == 0
X1 == 0
andX2 == 1
X1 == 1
andX2 == 0
X1 == 1
andX2 == 1
Within each group, I want to calculate the difference in the mean of Y
between observations with T == 1
and T == 0
.
I can do so with the following code:
# Libraries
import pandas as pd
# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()
# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']
# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']
> X1 X2
> 0 0 0.116175
> 1 0.168791
> 1 0 -0.027278
> 1 -0.147601
Problem
I want to recreate these results with the following regression model (m
).
# Libraries
from statsmodels.api import OLS
# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()
# Estimated values
m.params[['T','X1T','X2T']]
> T 0.162198
> X1T -0.230372
> X2T -0.034303
I was expecting the coefficients:
T
= 0.116175T + X1T
= 0.168791T + X2T
= -0.027278T + X1T + X2T
= -0.147601
Question
Why don't the regression coefficients match the results from the first chunk's output (t['Teq1'] - t['Teq0']
)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
感谢@josef注意到
t
,x1
和x2
具有八个不同的组合,而我的回归模型具有六个参数。因此,我缺少两个交互项(因此两个参数)。也就是说,回归模型需要考虑
x1
和x2
之间的相互作用,以及x1
,x2
之间的相互作用代码>和t
。这可以通过声明丢失的交互列并拟合模型来完成:
另外,我们可以使用公式接口:
Thanks to @Josef for noticing that
T
,X1
andX2
have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).Namely, the regression model needs to account for the interaction between
X1
andX2
as well as the interaction betweenX1
,X2
andT
.This can be done by declaring the missing interaction columns and fitting the model:
Alternatively, we can use the formula interface: