使用自定义得分手功能评估GridSearchCV期间的多个隔离森林估计器
我有一个没有AY target 值的值样本。实际上, x特征(预测变量)都用于拟合隔离森林估计器。目的是确定哪些X功能和将来的X功能实际上是异常值。因此,例如,假设我适合一个数组(340,3)=> (n_samples,n_features),我预测这些功能可以识别340个观测值中的哪个是 Upliers 。
到目前为止,我的方法是:
首先,我创建一个管道对象
from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)
,然后创建一个参数网格,以调整超参数调整
parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
'IsolationForest__contamination':[0.01,0.05],
'IsolationForest__bootstrap':[True, False]
}
,最后我应用了 gridsearchcv 算法,
isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)
我的目标是确定最适合得分功能(指出 SCORER_F ,可以有效地选择到目前为止检测的最合适的隔离森林估计器。
如 出色的答案,我的得分手如下:
得分手功能
def scorer_f(estimator, X):
thresh=np.quantile(estimator.score_samples(X), 0.05)
scores=estimator.score_samples(X)
return len(np.where(scores<thresh)[0])
是简短的解释,我不断地确定批次观测值的5%(0.05分位数)是异常值。结果,我指示GridSearch功能以最差的情况选择模型
以使您从结果中享受
isolation_forest_grid.cv_results_['mean_test_score']
array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])
, 可以看到大多数模型估计器具有4.0个异常值,因此我希望其余的选择是随机完成的。
总的来说,我想问这种方法是否有效 (数学上正确),可以产生有效的模型估计器以进行异常检测。异常检测算法的缺点是他们缺乏 sklearn.metrics 库中的得分手指标。这就是为什么我努力为GridSearchCV
方法找到一个良好的得分度量。
I have a sample of values that don't have a y target value. Actually, the X features (predictors) are all used to fit the Isolation Forest estimator. The goal is to identify which of those X-features and the ones to come in the future are actually outliers. So for example let's say that I fit an array (340,3) => (n_samples, n_features) and I predict those features to identify which of the 340 observations are outliers.
My approach so far is:
First I create a pipeline object
from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)
Then I create a parameters grid for the hyperparameter tuning
parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
'IsolationForest__contamination':[0.01,0.05],
'IsolationForest__bootstrap':[True, False]
}
Finally, I apply the GridSearchCV algorithm
isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)
My goal is to identify the best fit for a scoring function (noted as scorer_f that would efficiently select the most suitable isolation forest estimator for outlier detection.
So far, and based on this excellent answer, my scorer is as follows:
Scorer Function
def scorer_f(estimator, X):
thresh=np.quantile(estimator.score_samples(X), 0.05)
scores=estimator.score_samples(X)
return len(np.where(scores<thresh)[0])
A brief explanation, I identify constantly the 5% (0.05 quantile) of observations in the batch as outliers. Thus, every score less than the threshold is denoted as an outlier. As a result I instruct the GridSearch function to select the model with the most outliers as a worst-case scenario.
To give you a taste from the results:
isolation_forest_grid.cv_results_['mean_test_score']
array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])
The GridSearch function randomly selects the model on index 31 as the best model. As you can see most of the model estimators have 4.0 outliers, thus I expect the rest of the selection is done randomly.
Overall, I would like to ask if this approach is valid
(mathematically correct) and can produce valid model estimators for outlier detection. The drawback of outlier detection algorithms is their lack of a scorer metric in sklearn.metrics library. That's why I struggled in finding a good score metric for the GridSearchCV
method.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论