在管道中的一v-One的自定义重采样器
我正在努力实现基于svm
的自定义底面采样器。该课程通过选择班级支持向量附近的多数示例(最多是少数族裔示例的规模),将二进制类别的数据列为少数族裔的大小。
这是代码:
import numpy as np
from collections import Counter
from sklearn.svm import SVC
class NearSVUmdersampler():
def __init__(self, random_state=None):
self.random_state = random_state
def fit_resample(self, X, y):
random_state = check_random_state(self.random_state)
# class distribution
counter = Counter(y)
maj_class = counter.most_common()[0][0]
min_class = counter.most_common()[-1][0]
# number of minority examples
num_minority = len(X[ y == min_class])
svc = SVC(kernel='rbf', random_state=32)
svc.fit(X,y)
# majority class support vectors
maj_sup_vector = svc.support_vectors_[maj_class]
# compute distances to support vector points
distances = []
for i, x in enumerate(X[y == maj_class]):
d = np.linalg.norm(maj_sup_vector - x)
distances.append((i, d))
# sort distances (ascending)
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:num_minority]
X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))
return X_ds, y_ds
此类返回的重新采样数据与多数类平衡,以等于少数。
因此,我想在管道中使用此类进行多类
分类。我的目的是在一v的场景中执行此操作,以便在每个ovo
案例中,都会调用UnderSmapling来重新示例数据,以重新访问ovo 。
因此,例如,使用此虚拟数据:
# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)
xtrain, xtest, ytrain, ytest = train_test_split(X, y,
test_size=.2, random_state=12)
Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})
在其中我将拥有4(3-1)/2 = 6
模型ovo
案例。因此,在每个“ OVO”模型中,大多数类别的采样都应该如此:
Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126 -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192 -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192 -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330 -> 2:330, 3:330
考虑到这一点,我有兴趣使用svc
作为估计器作为OnevSoneClalsifier
,如下所示:
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
model = OneVsOneClassifier(
estimator=SVC(kernel='rbf'), n_jobs=-1)
resampler = NearSVUmdersampler(random_state=123)
并将其拟合为:
classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
<__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])
问题:
似乎仅调用重新采样器一次,将其传递给所有包含所有类别的火车数据。因此,它仅返回原始数据中的大多数和少数族裔,重新采样至多数。使其仅在两堂课上接受培训。
例如,在上面的MWE中,它返回:
{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126
模型3
的情况,并且所有其他情况都没有做任何事情。
如何考虑我拥有的管道,如何在ovo
中完成这项工作?
I am working to implement my custom undersampler that works based on SVM
. This class takes in binary class data and undersample the majority class to the size of minority, by selecting majority examples near the class's support vector, up to the size of minority examples.
Here's the code:
import numpy as np
from collections import Counter
from sklearn.svm import SVC
class NearSVUmdersampler():
def __init__(self, random_state=None):
self.random_state = random_state
def fit_resample(self, X, y):
random_state = check_random_state(self.random_state)
# class distribution
counter = Counter(y)
maj_class = counter.most_common()[0][0]
min_class = counter.most_common()[-1][0]
# number of minority examples
num_minority = len(X[ y == min_class])
svc = SVC(kernel='rbf', random_state=32)
svc.fit(X,y)
# majority class support vectors
maj_sup_vector = svc.support_vectors_[maj_class]
# compute distances to support vector points
distances = []
for i, x in enumerate(X[y == maj_class]):
d = np.linalg.norm(maj_sup_vector - x)
distances.append((i, d))
# sort distances (ascending)
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:num_minority]
X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))
return X_ds, y_ds
The resampled data returned by this class is balanced with majority class down to equal the minority.
So I wanted to use this class in a pipeline for multiclass
classification. My intention is to do this in a one-v-one scenario, so that in each ovo
case, the undersmapling is invoked to resample data for the current participating classes in the ovo
.
So, for example, with this dummy data:
# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)
xtrain, xtest, ytrain, ytest = train_test_split(X, y,
test_size=.2, random_state=12)
Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})
Where I would have 4(3-1)/2=6
models in ovo
cases. So in each 'ovo' model, majority class undersampling should go like so:
Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126 -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192 -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192 -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330 -> 2:330, 3:330
With this in mind, I am interested in using SVC
, as the estimator to OneVsOneClassifier
as follows:
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
model = OneVsOneClassifier(
estimator=SVC(kernel='rbf'), n_jobs=-1)
resampler = NearSVUmdersampler(random_state=123)
And fit this as:
classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
<__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])
Problem:
It appears the resampler is only invoked once, passing it all train data containing all classes. So it returns only the majority and minority in the original data, resampled to the size of majority. Making it trained only on two classes.
In the above MWE for instance, it returns:
{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126
That is the case of Model 3
, and nothing done for all other cases.
How to I make this work in a ovo
considering the pipeline I have?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试以下试验:
现在,当您调用
classifier.fit
OneVsoneClalsifier
将适合您的base_estimator
pipeline pipeline for to every nate数据列Try this:
Now when you call
classifier.fit
theOneVsOneClassifier
will fit yourbase_estimator
pipeline for each slice of the data, thus resampling for each pair of columns