与 Predict_proba 结果不一致
我正在致力于使用逻辑回归创建中风预测 Flask API。
我在 PyCharm 上构建了我的模型,并且当我定期运行它时,基于 Predict_proba 的分类分布良好(低/中/高风险分类的数量相当相等)。 以下是我的一些上下文模型:
df_features = {'age',
'hypertension',
'heart_disease',
'ever_married',
'Residence_type',
'avg_glucose_level',
'bmi',
'gender',
'work_type',
'smoking_status'}
df_target = ['stroke']
x = df[df_features]
y = df[df_target]
x_smote, y_smote = smote.fit_resample(x, y)
x_train, x_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size=0.33, random_state=42)
x_valid, x_test, y_valid, y_test = train_test_split(x_test, y_test, test_size=0.5, random_state=42)
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
md.fit(x_test, y_test.values.ravel())
pickle.dump(md, open('model.pkl', 'wb'))
model = pickle.load(open('model.pkl', 'rb'))
当尝试部署到 Flask 上时,我遇到了一个问题,几乎所有输入都被归类为低风险。 回顾我的模型,我发现由于某种原因,当我直接通过模型传递一组值时,结果看起来与以前的结果非常不同。
print(md.predict_proba(x_test))
将生成这些预测:
[[0.81931953 0.18068047]
[0.96735583 0.03264417]
[0.96280228 0.03719772]
...
[0.50304004 0.49695996]
[0.41301474 0.58698526]
[0.82213934 0.17786066]]
至于当我尝试传递特定数组时,
print(md.predict_proba([[50, 1, 0, 0, 1, 105.32, 32.6, 0, 0, 1]]))
将生成如下结果:
[1.00000000e+00 2.24494636e-16]
我也尝试在 Google Collab 上运行它,并且相同的命令给了我正确的结果,所以我很困惑。 谁能解释这是为什么吗?因为在API中,我分割了结果并使用predict_proba的[0]值对风险进行分类,但我的结果到处都是。
我创建了一个管道并运行了以下测试:
input_variables = pd.DataFrame([[50, 1, 0, 0, 1, 105.32, 32.6, 0, 0, 1]],
columns=headers,
dtype=float,
index=['input'])
prediction = pipe.predict(input_variables)
print("Prediction: ", prediction)
prediction_probab = pipe.predict_proba(input_variables)
print("Probabilities: ", prediction_probab)
结果如下:
Prediction: [0]
Probabilities: [[1.00000000e+00 3.64844539e-61]]
I'm working on creating a stroke prediction Flask API using logistic regression.
I've built my model on PyCharm and my classification based on predict_proba are well distributed when I run it regularly (a pretty equal number of low/moderate/high risk classifications).
Here is a bit of my model for context:
df_features = {'age',
'hypertension',
'heart_disease',
'ever_married',
'Residence_type',
'avg_glucose_level',
'bmi',
'gender',
'work_type',
'smoking_status'}
df_target = ['stroke']
x = df[df_features]
y = df[df_target]
x_smote, y_smote = smote.fit_resample(x, y)
x_train, x_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size=0.33, random_state=42)
x_valid, x_test, y_valid, y_test = train_test_split(x_test, y_test, test_size=0.5, random_state=42)
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
md.fit(x_test, y_test.values.ravel())
pickle.dump(md, open('model.pkl', 'wb'))
model = pickle.load(open('model.pkl', 'rb'))
When trying to deploy onto Flask, I encountered an issue where almost all my inputs are being classified as Low Risk.
Looking back at my model, I discovered that for some reason when I pass an set of values directly through the model, the results look very different from previous results.
print(md.predict_proba(x_test))
will generate these predictions:
[[0.81931953 0.18068047]
[0.96735583 0.03264417]
[0.96280228 0.03719772]
...
[0.50304004 0.49695996]
[0.41301474 0.58698526]
[0.82213934 0.17786066]]
As for when I try passing a specific array,
print(md.predict_proba([[50, 1, 0, 0, 1, 105.32, 32.6, 0, 0, 1]]))
will generate results like this:
[1.00000000e+00 2.24494636e-16]
I tried running it on Google Collab as well and the same command is giving me proper results so I'm stumped.
Can anyone explain why this is? Because in the API, I split the results and use the [0] value of predict_proba to classify the risk but my results are all over the place.
I created a pipeline and ran the following test:
input_variables = pd.DataFrame([[50, 1, 0, 0, 1, 105.32, 32.6, 0, 0, 1]],
columns=headers,
dtype=float,
index=['input'])
prediction = pipe.predict(input_variables)
print("Prediction: ", prediction)
prediction_probab = pipe.predict_proba(input_variables)
print("Probabilities: ", prediction_probab)
And got this as a result:
Prediction: [0]
Probabilities: [[1.00000000e+00 3.64844539e-61]]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当
sc
是 StandardScaler 并且md
是分类模型时,模型使用标准化特征进行训练。这是很好的做法。然而,上面存储的模型对缩放器一无所知!它只知道自己的特征在一定范围内。
现在,当您执行
md.predict_proba([[50, 1, 0, 0, 1, 105.32, 32.6, 0, 0, 1]])
时,您将原始(非标准化)特征传递给模型,这远远超出了模型所学的标准化范围。相反,您需要应用缩放器:您也需要在 API 中执行此操作。
解决方案 1:将
sc
和md
存储在一个 pickle 文件或单独的 pickle 文件中,然后在 API 中加载并应用它们。这样做的缺点是 API 与您的模型结构紧密耦合(如果您想添加另一个处理步骤,则需要更改 API 代码)。解决方案 2:将缩放器和模型放入 管道 中,您可以将其视为存储/加载/应用时的单个模型。
When
sc
is a StandardScaler andmd
is the classification model, the model is trained with standardized features.That's good practice. However, the model, as stored above, does not know anything about the scaler! It only knows that its features are within a certain range.
Now, when you do
md.predict_proba([[50, 1, 0, 0, 1, 105.32, 32.6, 0, 0, 1]])
you pass raw (unstandardized) features to the model, which are far outside of the standardized range the model has learned. Instead, you need to apply the scaler:You need to do that in the API too.
Solution 1: store both
sc
andmd
in a pickle file or separate pickle files, load and apply both in the API. This has the drawback that the API is tightly coupled to your model structure (if you want to add another processing step, you'd need to change the API code).Solution 2: put scaler and model into a pipeline, which you can treat as a single model when storing/loading/applying.