用模态积极学习 - 形状无效
我正在尝试在Python实施积极的学习。我的分类问题目前会采用Word2Vec矢量表示,并将其喂入随机的森林中。
我有一个很小的初始火车数据集,我想使用模态软件包来利用主动学习并增加其大小。
这是我到目前为止尝试的:
from modAL.models import ActiveLearner
learner = ActiveLearner(
estimator=RandomForestClassifier(),
query_strategy=modAL.uncertainty.uncertainty_sampling,
X_training=X_train0, y_training=y_train
)
test=test.reset_index()
for i in range(20):
query_idx, query_instance = learner.query(X_test0)
y_new = input('Classify:')
y_new=np.array([y_new])
learner.teach(np.array(
X_test0[query_idx].reshape(-1,1), y_new)
其中x_test0
是一个带有形状1056x 100的熊猫数据框架(即1056个示例,每个示例具有100个功能,即Word2Vec表示)。我把它留下来,好像我没有标记以稍后检查性能。 同样,y_train
是另一个包含培训数据(0s或1s)的二进制分类的pandas数据框。
我的问题是,我想让模态了解我在多个功能下工作,因此,每100个长度向量的分类是唯一的。在上面的示例中,出现以下错误:
ValueError: Found input variables with inconsistent numbers of samples: [100, 1]
在我看来,这100个功能仅与一个标签相对应...
有关如何解决它的任何线索?
编辑:我认为重塑功能可能是某种东西。由于似乎希望作为输入数组,所以我还尝试修改最后一行,如下所示:
learner.teach(X_test0.iloc[query_idx].values, np.array(y_new))
现在产生以下错误:
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
删除.values
使其成为数据帧也会产生错误:
TypeError: <class 'pandas.core.series.Series'> datatype is not supported
``
I am trying to implement active learning in Python. My classification problem currently takes Word2vec vector representations and feeds them into a Random Forest.
I have a tiny, initial train dataset and I would like to use the modAL package to exploit active learning and increase its size.
Here is what I've tried so far:
from modAL.models import ActiveLearner
learner = ActiveLearner(
estimator=RandomForestClassifier(),
query_strategy=modAL.uncertainty.uncertainty_sampling,
X_training=X_train0, y_training=y_train
)
test=test.reset_index()
for i in range(20):
query_idx, query_instance = learner.query(X_test0)
y_new = input('Classify:')
y_new=np.array([y_new])
learner.teach(np.array(
X_test0[query_idx].reshape(-1,1), y_new)
Where X_test0
is a pandas Dataframe with shape 1056x 100 (i.e 1056 examples with 100 features each, which are Word2vec representations). I leave this as if I had it unlabelled to later check performance.
Similarly, y_train
is another pandas dataframe containing the binary classification for the training data (0s or 1s).
My issue is that I want to make modAL understand that I am working under multiple features, and thus the classification is unique per every 100 length vector. In the example above, the following error appears:
ValueError: Found input variables with inconsistent numbers of samples: [100, 1]
It seems to me that it is not understanding that those 100 features correspond to only one label...
Any clue on how to solve it?
EDIT: I thought it might have been something with the reshaping function. Since it seems that it wants as an input an array, I also tried modifying the last line as follows:
learner.teach(X_test0.iloc[query_idx].values, np.array(y_new))
which now produces the following error:
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
Removing .values
to make it a dataframe also produces an error:
TypeError: <class 'pandas.core.series.Series'> datatype is not supported
``
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论