如何将WEKA的ADTrees分类器用作袋装Scikitlearn型号的基础?

发布于 2025-01-18 02:05:20 字数 3075 浏览 0 评论 0 原文

我的目的是使用Scikit-Learn和其他图书馆重新创建在WEKA上进行的大型模型。

我拥有Pyweka完成的基本模型。

base_model_1 = Classifier(classname="weka.classifiers.trees.ADTree", 
                  options=["-B", "10", "-E", "-3", "-S", "1"])

base_model_1.build_classifier(train_model_1)
base_model_1

但是,当我尝试将其用作这样的基本刺激器时:

model = BaggingClassifier(base_estimator= base_model_1, n_estimators = 100, n_jobs = 1, random_state = 1)

尝试评估这样的模型:

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
AUC_scores = cross_val_score(model, X_data_train, y_data_train, scoring='roc_auc', cv=cv, n_jobs=-1)
F1_scores = cross_val_score(model, X_data_train, y_data_train, scoring='f1', cv=cv, n_jobs=-1)
Precision_scores = cross_val_score(model, X_data_train, y_data_train, scoring='precision', cv=cv, n_jobs=-1)
Recall_scores = cross_val_score(model, X_data_train, y_data_train, scoring='recall', cv=cv, n_jobs=-1)
Accuracy_scores = cross_val_score(model, X_data_train, y_data_train, scoring='accuracy', cv=cv, n_jobs=-1)
print("-------------------------------------------------------")
print(AUC_scores)
print("-------------------------------------------------------")
print(F1_scores)
print("-------------------------------------------------------")
print(Precision_scores)
print("-------------------------------------------------------")
print(Recall_scores)
print("-------------------------------------------------------")
print(Accuracy_scores)
print("-------------------------------------------------------")
print('Mean ROC AUC: %.3f' % mean(AUC_scores))
print('Mean F1: %.3f' % mean(F1_scores))
print('Mean Precision: %.3f' % mean(Precision_scores))
print('Mean Recall: %.3f' % mean(Recall_scores))
print('Mean Accuracy: %.3f' % mean(Accuracy_scores))

Ijust会接收Nan:


Distribución Variable Clase Desbalanceada
0    161
1     34
Name: Soft-Tissue_injury_≥4days, dtype: int64
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
Mean ROC AUC: nan
Mean F1: nan
Mean Precision: nan
Mean Recall: nan
Mean Accuracy: nan

因此,我认为我认为我不正确地使用Adtree分类器作为装袋基础。

有什么方法可以正确执行此操作吗?

My intention is to recreate a big model done on weka using scikit-learn and other libraries.

I have this base model done with pyweka.

base_model_1 = Classifier(classname="weka.classifiers.trees.ADTree", 
                  options=["-B", "10", "-E", "-3", "-S", "1"])

base_model_1.build_classifier(train_model_1)
base_model_1

But when i try to use it as base stimattor like that:

model = BaggingClassifier(base_estimator= base_model_1, n_estimators = 100, n_jobs = 1, random_state = 1)

and trying to evaluate the model like that:

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
AUC_scores = cross_val_score(model, X_data_train, y_data_train, scoring='roc_auc', cv=cv, n_jobs=-1)
F1_scores = cross_val_score(model, X_data_train, y_data_train, scoring='f1', cv=cv, n_jobs=-1)
Precision_scores = cross_val_score(model, X_data_train, y_data_train, scoring='precision', cv=cv, n_jobs=-1)
Recall_scores = cross_val_score(model, X_data_train, y_data_train, scoring='recall', cv=cv, n_jobs=-1)
Accuracy_scores = cross_val_score(model, X_data_train, y_data_train, scoring='accuracy', cv=cv, n_jobs=-1)
print("-------------------------------------------------------")
print(AUC_scores)
print("-------------------------------------------------------")
print(F1_scores)
print("-------------------------------------------------------")
print(Precision_scores)
print("-------------------------------------------------------")
print(Recall_scores)
print("-------------------------------------------------------")
print(Accuracy_scores)
print("-------------------------------------------------------")
print('Mean ROC AUC: %.3f' % mean(AUC_scores))
print('Mean F1: %.3f' % mean(F1_scores))
print('Mean Precision: %.3f' % mean(Precision_scores))
print('Mean Recall: %.3f' % mean(Recall_scores))
print('Mean Accuracy: %.3f' % mean(Accuracy_scores))

Ijust receive NaN:


Distribución Variable Clase Desbalanceada
0    161
1     34
Name: Soft-Tissue_injury_≥4days, dtype: int64
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-------------------------------------------------------
Mean ROC AUC: nan
Mean F1: nan
Mean Precision: nan
Mean Recall: nan
Mean Accuracy: nan

So I think i'm useing incorrectly the ADTree classifier as bagging base.

Is there any way to do this correctly?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一曲琵琶半遮面シ 2025-01-25 02:05:20

我刚刚发布了

import os
from statistics import mean

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score

import sklweka.jvm as jvm
from sklweka.classifiers import WekaEstimator
from sklweka.dataset import load_arff

jvm.start(packages=True)

# adjust the path to your dataset
# the example assumes all attributes and class to be nominal
data_file = "/some/where/vote.arff"
X, y, meta = load_arff(data_file, class_index="last")

base_model_1 = WekaEstimator(classname="weka.classifiers.trees.ADTree",
                             options=["-B", "10", "-E", "-3", "-S", "1"],
                             nominal_input_vars="first-last",  # which attributes need to be treated as nominal
                             nominal_output_var=True)          # class is nominal as well
model = BaggingClassifier(base_estimator=base_model_1, n_estimators=100, n_jobs=1, random_state=1)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
accuracy_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=None)  # single process!
print("-------------------------------------------------------")
print(accuracy_scores)
print("-------------------------------------------------------")
print('Mean Accuracy: %.3f' % mean(accuracy_scores))

jvm.stop()

这会生成以下输出:

-------------------------------------------------------
[0.97727273 0.95454545 0.95454545 0.95454545 0.97727273 0.90697674
 1.         0.90697674 0.95348837 0.95348837 0.97727273 0.95454545
 0.90909091 0.88636364 0.97727273 0.97674419 0.97674419 0.97674419
 0.97674419 0.97674419 0.93181818 0.97727273 0.93181818 0.90909091
 1.         1.         1.         0.90697674 0.97674419 0.95348837]
-------------------------------------------------------
Mean Accuracy: 0.957

请注意,您可能会得到一个例外,例如对象在尝试生成其他指标时没有属性'deciest_function'这篇文章帮助这一点。

最后,由于使用JVM和 n_jobs = none )。

I've just released a version 0.0.5 of sklearn-weka-plugin, with which you can do the following:

import os
from statistics import mean

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score

import sklweka.jvm as jvm
from sklweka.classifiers import WekaEstimator
from sklweka.dataset import load_arff

jvm.start(packages=True)

# adjust the path to your dataset
# the example assumes all attributes and class to be nominal
data_file = "/some/where/vote.arff"
X, y, meta = load_arff(data_file, class_index="last")

base_model_1 = WekaEstimator(classname="weka.classifiers.trees.ADTree",
                             options=["-B", "10", "-E", "-3", "-S", "1"],
                             nominal_input_vars="first-last",  # which attributes need to be treated as nominal
                             nominal_output_var=True)          # class is nominal as well
model = BaggingClassifier(base_estimator=base_model_1, n_estimators=100, n_jobs=1, random_state=1)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
accuracy_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=None)  # single process!
print("-------------------------------------------------------")
print(accuracy_scores)
print("-------------------------------------------------------")
print('Mean Accuracy: %.3f' % mean(accuracy_scores))

jvm.stop()

This generates the following output:

-------------------------------------------------------
[0.97727273 0.95454545 0.95454545 0.95454545 0.97727273 0.90697674
 1.         0.90697674 0.95348837 0.95348837 0.97727273 0.95454545
 0.90909091 0.88636364 0.97727273 0.97674419 0.97674419 0.97674419
 0.97674419 0.97674419 0.93181818 0.97727273 0.93181818 0.90909091
 1.         1.         1.         0.90697674 0.97674419 0.95348837]
-------------------------------------------------------
Mean Accuracy: 0.957

Please note, that you might get an exception like object has no attribute 'decision_function' when trying to generate other metrics. This article might help with that.

Finally, a limitation due to using a JVM and python-javabridge in the background is that you cannot fork processes and distribute jobs across your cores (n_jobs=None).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文