使用自定义得分手功能评估GridSearchCV期间的多个隔离森林估计器

发布于 2025-02-03 21:33:02 字数 2120 浏览 4 评论 0原文

我有一个没有AY target 值的值样本。实际上, x特征(预测变量)都用于拟合隔离森林估计器。目的是确定哪些X功能和将来的X功能实际上是异常值。因此,例如,假设我适合一个数组(340,3)=> (n_samples,n_features),我预测这些功能可以识别340个观测值中的哪个是 Upliers

到目前为止,我的方法是:

首先,我创建一个管道对象

from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)

,然后创建一个参数网格,以调整超参数调整

parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
                  'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
                  'IsolationForest__contamination':[0.01,0.05],
                  'IsolationForest__bootstrap':[True, False]
                 }

,最后我应用了 gridsearchcv 算法,

isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)

我的目标是确定最适合得分功能(指出 SCORER_F ,可以有效地选择到目前为止检测的最合适的隔离森林估计器。

如 出色的答案,我的得分手如下:

得分手功能

def scorer_f(estimator, X):
  thresh=np.quantile(estimator.score_samples(X), 0.05)
  scores=estimator.score_samples(X)
  return len(np.where(scores<thresh)[0])

是简短的解释,我不断地确定批次观测值的5%(0.05分位数)是异常值。结果,我指示GridSearch功能以最差的情况选择模型

以使您从结果中享受

isolation_forest_grid.cv_results_['mean_test_score']

array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])

, 可以看到大多数模型估计器具有4.0个异常值,因此我希望其余的选择是随机完成的。

总的来说,我想问这种方法是否有效 (数学上正确),可以产生有效的模型估计器以进行异常检测。异常检测算法的缺点是他们缺乏 sklearn.metrics 库中的得分手指标。这就是为什么我努力为GridSearchCV方法找到一个良好的得分度量。

I have a sample of values that don't have a y target value. Actually, the X features (predictors) are all used to fit the Isolation Forest estimator. The goal is to identify which of those X-features and the ones to come in the future are actually outliers. So for example let's say that I fit an array (340,3) => (n_samples, n_features) and I predict those features to identify which of the 340 observations are outliers.

My approach so far is:

First I create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)

Then I create a parameters grid for the hyperparameter tuning

parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
                  'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
                  'IsolationForest__contamination':[0.01,0.05],
                  'IsolationForest__bootstrap':[True, False]
                 }

Finally, I apply the GridSearchCV algorithm

isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)

My goal is to identify the best fit for a scoring function (noted as scorer_f that would efficiently select the most suitable isolation forest estimator for outlier detection.

So far, and based on this excellent answer, my scorer is as follows:

Scorer Function

def scorer_f(estimator, X):
  thresh=np.quantile(estimator.score_samples(X), 0.05)
  scores=estimator.score_samples(X)
  return len(np.where(scores<thresh)[0])

A brief explanation, I identify constantly the 5% (0.05 quantile) of observations in the batch as outliers. Thus, every score less than the threshold is denoted as an outlier. As a result I instruct the GridSearch function to select the model with the most outliers as a worst-case scenario.

To give you a taste from the results:

isolation_forest_grid.cv_results_['mean_test_score']

array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])

The GridSearch function randomly selects the model on index 31 as the best model. As you can see most of the model estimators have 4.0 outliers, thus I expect the rest of the selection is done randomly.

Overall, I would like to ask if this approach is valid
(mathematically correct) and can produce valid model estimators for outlier detection. The drawback of outlier detection algorithms is their lack of a scorer metric in sklearn.metrics library. That's why I struggled in finding a good score metric for the GridSearchCV method.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文