使用自定义得分手功能评估GridSearchCV期间的多个隔离森林估计器

发布于 2025-02-03 21:33:02 字数 2120 浏览 4 评论 0原文

我有一个没有AY target 值的值样本。实际上， x特征（预测变量）都用于拟合隔离森林估计器。目的是确定哪些X功能和将来的X功能实际上是异常值。因此，例如，假设我适合一个数组（340,3）=＆gt; （n_samples，n_features），我预测这些功能可以识别340个观测值中的哪个是 Upliers 。

到目前为止，我的方法是：

首先，我创建一个管道对象

from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)

，然后创建一个参数网格，以调整超参数调整

parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
                  'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
                  'IsolationForest__contamination':[0.01,0.05],
                  'IsolationForest__bootstrap':[True, False]
                 }

，最后我应用了 gridsearchcv 算法，

isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)

我的目标是确定最适合得分功能（指出 SCORER_F ，可以有效地选择到目前为止检测的最合适的隔离森林估计器。

如出色的答案，我的得分手如下：

得分手功能

def scorer_f(estimator, X):
  thresh=np.quantile(estimator.score_samples(X), 0.05)
  scores=estimator.score_samples(X)
  return len(np.where(scores<thresh)[0])

是简短的解释，我不断地确定批次观测值的5％（0.05分位数）是异常值。结果，我指示GridSearch功能以最差的情况选择模型

以使您从结果中享受

isolation_forest_grid.cv_results_['mean_test_score']

array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])

，可以看到大多数模型估计器具有4.0个异常值，因此我希望其余的选择是随机完成的。

总的来说，我想问这种方法是否有效 （数学上正确），可以产生有效的模型估计器以进行异常检测。异常检测算法的缺点是他们缺乏 sklearn.metrics 库中的得分手指标。这就是为什么我努力为GridSearchCV方法找到一个良好的得分度量。

原文

I have a sample of values that don't have a y target value. Actually, the X features (predictors) are all used to fit the Isolation Forest estimator. The goal is to identify which of those X-features and the ones to come in the future are actually outliers. So for example let's say that I fit an array (340,3) => (n_samples, n_features) and I predict those features to identify which of the 340 observations are outliers.

My approach so far is:

First I create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)

Then I create a parameters grid for the hyperparameter tuning

parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
                  'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
                  'IsolationForest__contamination':[0.01,0.05],
                  'IsolationForest__bootstrap':[True, False]
                 }

Finally, I apply the GridSearchCV algorithm

isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)

My goal is to identify the best fit for a scoring function (noted as scorer_f that would efficiently select the most suitable isolation forest estimator for outlier detection.

So far, and based on this excellent answer, my scorer is as follows:

Scorer Function

def scorer_f(estimator, X):
  thresh=np.quantile(estimator.score_samples(X), 0.05)
  scores=estimator.score_samples(X)
  return len(np.where(scores<thresh)[0])

A brief explanation, I identify constantly the 5% (0.05 quantile) of observations in the batch as outliers. Thus, every score less than the threshold is denoted as an outlier. As a result I instruct the GridSearch function to select the model with the most outliers as a worst-case scenario.

To give you a taste from the results:

isolation_forest_grid.cv_results_['mean_test_score']

array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])

The GridSearch function randomly selects the model on index 31 as the best model. As you can see most of the model estimators have 4.0 outliers, thus I expect the rest of the selection is done randomly.

Overall, I would like to ask if this approach is valid
(mathematically correct) and can produce valid model estimators for outlier detection. The drawback of outlier detection algorithms is their lack of a scorer metric in sklearn.metrics library. That's why I struggled in finding a good score metric for the GridSearchCV method.

分享到QQ

分享到微博