回归系数与条件均值不匹配

发布于 2025-01-25 06:15:51 字数 3218 浏览 0 评论 0原文

您可以从此repo 下载以下数据集。

	Y	CONST	T	X1	X1T	X2	X2T
0	2.31252	1	1 1	0	0	1	1 1 1
1	-0.836074	1	1	1 1	1	1 1 1 1 1 1	1
1	1 1 1 1 1 1 1 0 0 0	0	0 0	0	0	1	0

I具有因数（y）和三个二进制列（t，x1和x2）。从此数据中，我们可以创建四个组：

x1 == 0和x2 == 0
x1 == 0和x2 ================================= 1
x1 == 1和x2 == 0
x1 == 1和x2 == 1

在每个组中，我想计算使用y的均值差异，其中t == 1和t == 0。

我可以使用以下代码这样做：

# Libraries
import pandas as pd

# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()

# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']

# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']

> X1  X2
> 0   0     0.116175
>     1     0.168791
> 1   0    -0.027278
>     1    -0.147601

。

我想通过以下回归模型（m）重新创建这些结果

# Libraries
from statsmodels.api import OLS

# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()

# Estimated values
m.params[['T','X1T','X2T']]

> T      0.162198
> X1T   -0.230372
> X2T   -0.034303

我期望的系数：

t = 0.116175
t + x1t = 0.168791
= -0.027278
t + x2t /code> = -0.147601

问题

为什么回归系数不匹配第一个块的输出（t ['teq1'] - t ['teq0']）？

原文

You can download the following data set from this repo.

	Y	CONST	T	X1	X1T	X2	X2T
0	2.31252	1	1	0	0	1	1
1	-0.836074	1	1	1	1	1	1
2	-0.797183	1	0	0	0	1	0

I have a dependent variable (Y) and three binary columns (T, X1 and X2). From this data we can create four groups:

X1 == 0 and X2 == 0
X1 == 0 and X2 == 1
X1 == 1 and X2 == 0
X1 == 1 and X2 == 1

Within each group, I want to calculate the difference in the mean of Y between observations with T == 1 and T == 0.

I can do so with the following code:

# Libraries
import pandas as pd

# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()

# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']

# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']

> X1  X2
> 0   0     0.116175
>     1     0.168791
> 1   0    -0.027278
>     1    -0.147601

Problem

I want to recreate these results with the following regression model (m).

# Libraries
from statsmodels.api import OLS

# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()

# Estimated values
m.params[['T','X1T','X2T']]

> T      0.162198
> X1T   -0.230372
> X2T   -0.034303

I was expecting the coefficients:

T = 0.116175
T + X1T = 0.168791
T + X2T = -0.027278
T + X1T + X2T = -0.147601

Question

Why don't the regression coefficients match the results from the first chunk's output (t['Teq1'] - t['Teq0'])?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

百变从容 2025-02-01 06:15:51

感谢@josef注意到t，x1和x2具有八个不同的组合，而我的回归模型具有六个参数。因此，我缺少两个交互项（因此两个参数）。

也就是说，回归模型需要考虑x1和x2之间的相互作用，以及x1，x2之间的相互作用代码>和t。

这可以通过声明丢失的交互列并拟合模型来完成：

# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
               X1X2T = df['X1'].multiply(df['X2T']))

# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']

# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()

另外，我们可以使用公式接口：

# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'

# Fit model
m = OLS.from_formula(formula=f, data=df).fit()

Thanks to @Josef for noticing that T, X1 and X2 have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).

Namely, the regression model needs to account for the interaction between X1 and X2 as well as the interaction between X1, X2 and T.

This can be done by declaring the missing interaction columns and fitting the model:

# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
               X1X2T = df['X1'].multiply(df['X2T']))

# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']

# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()

Alternatively, we can use the formula interface:

# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'

# Fit model
m = OLS.from_formula(formula=f, data=df).fit()

回复收藏 0 原文

~没有更多了~

关于作者

暖树树初阳…

暂无简介

文章

24 人气

关注发私信

友情链接

文江博客

回归系数与条件均值不匹配

。

问题

Problem

Question

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

回归系数与条件均值不匹配

。

问题

Problem

Question

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。