截距是逻辑回归中实际值的一半

发布于 2025-02-11 21:16:12 字数 805 浏览 1 评论 0原文

对于一项科学研究,我需要使用Python和Sci-Kit学习分析传统的逻辑回归。在将我的回归模型与“ nunthy ='none'”一起拟合后,我可以获得正确的系数,但截距是实际值的一半。我的代码主要如下:

df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_

使用统计模型,我获得了截距(常数)“ 28.7140”,但使用SCI-KIT学习“ 14.35698738”。其他系数相同。我在SPSS上验证了它,第一个是正确的值。我不想仅将StatsModels用于逻辑回归。你能帮忙吗?

PS:没有拦截模型效果很好。

For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:

df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_

With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?

PS: Without intercept model works fine.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

请止步禁区 2025-02-18 21:16:12

此处的问题是,在您发布的代码中,您将常数项(1列)添加到x_train带有x_train = sm.add_constant(x_train)。然后,您将相同的x_train对象传递给Sklearn的logisticRegression()方法,其中fit_intercept =的默认值是true> true。因此,在那个阶段,您最终会创建另一个恒定的术语,从而导致估计系数的差异。

因此,您应该在sklearn代码中关闭fit_intercept =,或者离开fit_intercept = true,但使用x_train数组没有添加的恒定项。

The issue here is that in the code you posted you add a constant term (a column of 1's) to x_train with x_train = sm.add_constant(x_train). Then, you pass that same x_train object to sklearn's LogisticRegression() method where the default value of fit_intercept= is True. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.

So, you should either turn off fit_intercept= in the sklearn code, or leave fit_intercept=True but use the x_train array without the added constant term.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文