在管道中的一v-One的自定义重采样器

发布于 2025-02-10 20:47:19 字数 3450 浏览 3 评论 0原文

我正在努力实现基于svm的自定义底面采样器。该课程通过选择班级支持向量附近的多数示例（最多是少数族裔示例的规模），将二进制类别的数据列为少数族裔的大小。

这是代码：


import numpy as np
from collections import Counter

from sklearn.svm import SVC

class NearSVUmdersampler():
  def __init__(self, random_state=None):
    self.random_state = random_state
  
  def fit_resample(self, X, y):
    random_state = check_random_state(self.random_state)
    # class distribution
    counter = Counter(y)
    maj_class = counter.most_common()[0][0]
    min_class = counter.most_common()[-1][0]
    # number of minority examples
    num_minority = len(X[ y == min_class])
    svc = SVC(kernel='rbf', random_state=32)
    svc.fit(X,y)
    # majority class support vectors
    maj_sup_vector = svc.support_vectors_[maj_class]
    # compute distances to support vector points
    distances = []
    for i, x in enumerate(X[y == maj_class]):
      d = np.linalg.norm(maj_sup_vector - x) 
      distances.append((i, d))
    # sort distances (ascending)
    distances.sort(key=lambda tup: tup[1])
    index = [i for i, d in distances][:num_minority]
    X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
    y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))

    return X_ds, y_ds

此类返回的重新采样数据与多数类平衡，以等于少数。

因此，我想在管道中使用此类进行多类分类。我的目的是在一v的场景中执行此操作，以便在每个ovo案例中，都会调用UnderSmapling来重新示例数据，以重新访问ovo 。

因此，例如，使用此虚拟数据：

# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)

xtrain, xtest, ytrain, ytest = train_test_split(X, y, 
                test_size=.2, random_state=12)

Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})

在其中我将拥有4（3-1）/2 = 6模型ovo案例。因此，在每个“ OVO”模型中，大多数类别的采样都应该如此：

Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126 
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126  -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192  -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192  -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330  -> 2:330, 3:330

考虑到这一点，我有兴趣使用svc作为估计器作为OnevSoneClalsifier，如下所示：

from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier

model = OneVsOneClassifier(
    estimator=SVC(kernel='rbf'), n_jobs=-1)

resampler = NearSVUmdersampler(random_state=123)

并将其拟合为：

classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
                 <__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
                ('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])

问题：

似乎仅调用重新采样器一次，将其传递给所有包含所有类别的火车数据。因此，它仅返回原始数据中的大多数和少数族裔，重新采样至多数。使其仅在两堂课上接受培训。

例如，在上面的MWE中，它返回：

{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126

模型3的情况，并且所有其他情况都没有做任何事情。

如何考虑我拥有的管道，如何在ovo中完成这项工作？

原文

I am working to implement my custom undersampler that works based on SVM. This class takes in binary class data and undersample the majority class to the size of minority, by selecting majority examples near the class's support vector, up to the size of minority examples.

Here's the code:


import numpy as np
from collections import Counter

from sklearn.svm import SVC

class NearSVUmdersampler():
  def __init__(self, random_state=None):
    self.random_state = random_state
  
  def fit_resample(self, X, y):
    random_state = check_random_state(self.random_state)
    # class distribution
    counter = Counter(y)
    maj_class = counter.most_common()[0][0]
    min_class = counter.most_common()[-1][0]
    # number of minority examples
    num_minority = len(X[ y == min_class])
    svc = SVC(kernel='rbf', random_state=32)
    svc.fit(X,y)
    # majority class support vectors
    maj_sup_vector = svc.support_vectors_[maj_class]
    # compute distances to support vector points
    distances = []
    for i, x in enumerate(X[y == maj_class]):
      d = np.linalg.norm(maj_sup_vector - x) 
      distances.append((i, d))
    # sort distances (ascending)
    distances.sort(key=lambda tup: tup[1])
    index = [i for i, d in distances][:num_minority]
    X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
    y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))

    return X_ds, y_ds

The resampled data returned by this class is balanced with majority class down to equal the minority.

So I wanted to use this class in a pipeline for multiclass classification. My intention is to do this in a one-v-one scenario, so that in each ovo case, the undersmapling is invoked to resample data for the current participating classes in the ovo.

So, for example, with this dummy data:

# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)

xtrain, xtest, ytrain, ytest = train_test_split(X, y, 
                test_size=.2, random_state=12)

Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})

Where I would have 4(3-1)/2=6 models in ovocases. So in each 'ovo' model, majority class undersampling should go like so:

Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126 
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126  -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192  -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192  -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330  -> 2:330, 3:330

With this in mind, I am interested in using SVC, as the estimator to OneVsOneClassifier as follows:

from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier

model = OneVsOneClassifier(
    estimator=SVC(kernel='rbf'), n_jobs=-1)

resampler = NearSVUmdersampler(random_state=123)

And fit this as:

classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
                 <__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
                ('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])

Problem:

It appears the resampler is only invoked once, passing it all train data containing all classes. So it returns only the majority and minority in the original data, resampled to the size of majority. Making it trained only on two classes.

In the above MWE for instance, it returns:

{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126

That is the case of Model 3, and nothing done for all other cases.

How to I make this work in a ovo considering the pipeline I have?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱娆 2025-02-17 20:47:19

尝试以下试验：

model = SVC(kernel='rbf')
resampler = NearSVUmdersampler(random_state=123)
base_estimator = Pipeline([('sampler', resampler), ('clf', model)])
classifier = OneVsOneClassifier(estimator=base_estimator)

现在，当您调用classifier.fit OneVsoneClalsifier将适合您的base_estimator pipeline pipeline for to every nate数据列

Try this:

model = SVC(kernel='rbf')
resampler = NearSVUmdersampler(random_state=123)
base_estimator = Pipeline([('sampler', resampler), ('clf', model)])
classifier = OneVsOneClassifier(estimator=base_estimator)

Now when you call classifier.fit the OneVsOneClassifier will fit your base_estimator pipeline for each slice of the data, thus resampling for each pair of columns

回复收藏 0 原文

~没有更多了~