回归系数与条件均值不匹配

发布于 2025-01-25 06:15:51 字数 3218 浏览 0 评论 0原文

您可以从此repo 下载以下数据集

YCONSTTX1X1TX2X2T
02.3125211 10011 1 1
1-0.836074111 111 1 1 1 1 11
11 1 1 1 1 1 1 0 0 000 00010

I具有因数(y)和三个二进制列(tx1x2)。从此数据中,我们可以创建四个组:

  1. x1 == 0x2 == 0
  2. x1 == 0x2 ================================= 1
  3. x1 == 1x2 == 0
  4. x1 == 1x2 == 1

在每个组中,我想计算使用y的均值差异,其中t == 1t == 0

我可以使用以下代码这样做:

# Libraries
import pandas as pd

# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()

# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']

# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']

> X1  X2
> 0   0     0.116175
>     1     0.168791
> 1   0    -0.027278
>     1    -0.147601

我想通过以下回归模型(m)重新创建这些结果

# Libraries
from statsmodels.api import OLS

# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()

# Estimated values
m.params[['T','X1T','X2T']]

> T      0.162198
> X1T   -0.230372
> X2T   -0.034303

我期望的系数:

  1. t = 0.116175
  2. t + x1t = 0.168791
  3. = -0.027278
  4. t + x2t /code> = -0.147601

问题

为什么回归系数不匹配第一个块的输出(t ['teq1'] - t ['teq0'])?

You can download the following data set from this repo.

YCONSTTX1X1TX2X2T
02.31252110011
1-0.836074111111
2-0.797183100010

I have a dependent variable (Y) and three binary columns (T, X1 and X2). From this data we can create four groups:

  1. X1 == 0 and X2 == 0
  2. X1 == 0 and X2 == 1
  3. X1 == 1 and X2 == 0
  4. X1 == 1 and X2 == 1

Within each group, I want to calculate the difference in the mean of Y between observations with T == 1 and T == 0.

I can do so with the following code:

# Libraries
import pandas as pd

# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()

# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']

# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']

> X1  X2
> 0   0     0.116175
>     1     0.168791
> 1   0    -0.027278
>     1    -0.147601

Problem

I want to recreate these results with the following regression model (m).

# Libraries
from statsmodels.api import OLS

# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()

# Estimated values
m.params[['T','X1T','X2T']]

> T      0.162198
> X1T   -0.230372
> X2T   -0.034303

I was expecting the coefficients:

  1. T = 0.116175
  2. T + X1T = 0.168791
  3. T + X2T = -0.027278
  4. T + X1T + X2T = -0.147601

Question

Why don't the regression coefficients match the results from the first chunk's output (t['Teq1'] - t['Teq0'])?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

百变从容 2025-02-01 06:15:51

感谢@josef注意到tx1x2具有八个不同的组合,而我的回归模型具有六个参数。因此,我缺少两个交互项(因此两个参数)。

也就是说,回归模型需要考虑x1x2之间的相互作用,以及x1x2之间的相互作用代码>和t

这可以通过声明丢失的交互列并拟合模型来完成:

# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
               X1X2T = df['X1'].multiply(df['X2T']))

# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']

# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()

另外,我们可以使用公式接口:

# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'

# Fit model
m = OLS.from_formula(formula=f, data=df).fit()

Thanks to @Josef for noticing that T, X1 and X2 have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).

Namely, the regression model needs to account for the interaction between X1 and X2 as well as the interaction between X1, X2 and T.

This can be done by declaring the missing interaction columns and fitting the model:

# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
               X1X2T = df['X1'].multiply(df['X2T']))

# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']

# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()

Alternatively, we can use the formula interface:

# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'

# Fit model
m = OLS.from_formula(formula=f, data=df).fit()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文