截距是逻辑回归中实际值的一半
对于一项科学研究,我需要使用Python和Sci-Kit学习分析传统的逻辑回归。在将我的回归模型与“ nunthy ='none'”一起拟合后,我可以获得正确的系数,但截距是实际值的一半。我的代码主要如下:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
使用统计模型,我获得了截距(常数)“ 28.7140”,但使用SCI-KIT学习“ 14.35698738”。其他系数相同。我在SPSS上验证了它,第一个是正确的值。我不想仅将StatsModels用于逻辑回归。你能帮忙吗?
PS:没有拦截模型效果很好。
For a scientific study, I need to analyze the traditional logistic regression using python and sci-kit learn. After fitting my regression model with "penalty='none'", I can get the correct coefficients but the intercept is the half of the real value. My code is mostly as follows:
df = pd.read_excel("data.xlsx")
train, test = train_test_split(df, train_size = 0.8, random_state = 42)
train = train.drop(["Unnamed: 0"], axis = 1)
test = test.drop(["Unnamed: 0"], axis = 1)
x_train = train.drop(["GRUP"], axis = 1)
x_train = sm.add_constant(x_train)
y_train = train["GRUP"]
x_test = test.drop(["GRUP"], axis = 1)
x_test = sm.add_constant(x_test)
y_test = test["GRUP"]
model = sm.Logit(y_train, x_train).fit()
model.summary()
log = LogisticRegression(penalty = "none")
log.fit(x_train, y_train)
log.intercept_
With statsmodels I get the intercept (constant) "28.7140" but with the sci-kit learn "14.35698738". Other coefficients are same. I verified it on SPSS and the first one is the correct value. I don't want to use statsmodels only for logistic regression. Could you please help?
PS: Without intercept model works fine.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
此处的问题是,在您发布的代码中,您将常数项(1列)添加到
x_train
带有x_train = sm.add_constant(x_train)
。然后,您将相同的x_train
对象传递给Sklearn的logisticRegression()
方法,其中fit_intercept =
的默认值是true> true
。因此,在那个阶段,您最终会创建另一个恒定的术语,从而导致估计系数的差异。因此,您应该在
sklearn
代码中关闭fit_intercept =
,或者离开fit_intercept = true
,但使用x_train
数组没有添加的恒定项。The issue here is that in the code you posted you add a constant term (a column of 1's) to
x_train
withx_train = sm.add_constant(x_train)
. Then, you pass that samex_train
object to sklearn'sLogisticRegression()
method where the default value offit_intercept=
isTrue
. So, at that stage, you end up creating another constant term, causing the discrepancy in your estimated coefficients.So, you should either turn off
fit_intercept=
in thesklearn
code, or leavefit_intercept=True
but use thex_train
array without the added constant term.