当前位置：文江博客话题详情

具有多个功能的Sklearn幼稚贝叶斯

发布于 2025-01-24 15:31:45 字数 8948 浏览 3 评论 0原文

背景

我正在努力在python中以sklearn跨多个功能实现幼稚的贝叶斯分类器。

我拥有的功能是：

标题 - 一些简短的文本
描述 - 一些更长的文本
时间戳 - 代表一天中一个小时的浮点（例如18.0 = 6：00 = 6：00 pm，11.5 = 11：30 AM）

标签/类是分类字符串：例如” Class1“，“ class2”，“ class3”

目标

我的目标是使用3个功能，以构建3个功能的天真贝叶斯分类器，以预测类标签。我特别希望同时使用所有功能，即不仅仅是 Description 功能。

初始方法

我使用sklearn设置了一些预处理管道，如下所示：

from sklearn import preprocessing, naive_bayes, feature_extraction, pipeline, model_selection, compose,

text_columns = ['title', 'description']
time_columns = ['timestamp']

# get an 80-20 test-train split
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

# convert the text data into vectors
text_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])

# preprocess by scaling the data, and binning the data
time_pipeline = pipeline.Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('bin', preprocessing.KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='quantile')),
])

# combine the pre-processors
preprocessor = compose.ColumnTransformer([
    ('text', text_pipeline, text_columns),
    ('time', time_pipeline, time_columns),
])

clf = pipeline.Pipeline([
    ('preprocessor', preprocessor),
    ('clf', naive_bayes.MultinomialNB()),
])

train是pandas带有功能和标签从.csv这样的文件中：

ID,title,description,timestamp,class
1,First Title String,"A description of the first title",13.0,Class1
2,Second Title String,"A description of the second title",17.5,Class2

还请注意，我不是为变形金刚/分类器设置大多数参数，因为我想使用网格搜索以稍后查找最佳访问量。

我调用`clf.fit（x_train，y_train）`时的问题

，我会收到以下错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/3039541201.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    388         """
    389         fit_params_steps = self._check_fit_params(**fit_params)
--> 390         Xt = self._fit(X, y, **fit_params_steps)
    391         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392             if self._final_estimator != "passthrough":

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    346                 cloned_transformer = clone(transformer)
    347             # Fit or load from cache the current transformer
--> 348             X, fitted_transformer = fit_transform_one_cached(
    349                 cloned_transformer,
    350                 X,

~/.local/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    697         self._record_output_indices(Xs)
    698 
--> 699         return self._hstack(list(Xs))
    700 
    701     def transform(self, X):

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
    789         else:
    790             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 791             return np.hstack(Xs)
    792 
    793     def _sk_visual_block_(self):

<__array_function__ internals> in hstack(*args, **kwargs)

~/.local/lib/python3.9/site-packages/numpy/core/shape_base.py in hstack(tup)
    344         return _nx.concatenate(arrs, 0)
    345     else:
--> 346         return _nx.concatenate(arrs, 1)
    347 
    348 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3001

我有以下形状的x_train和y_train y_train ：

X_train: (3001, 3)
y_train: (3001,)

步骤操作的

单个功能

我可以使用具有单个功能的相同管道（通过更改text_features和time_features arrays），并获得一个完美的分类器。例如，仅使用“标题”字段，或仅使用“时间戳”字段。不幸的是，这些单个功能不够准确，因此我想使用所有功能来构建更准确的分类器。这个问题似乎是当我尝试将多个功能结合起来时。

我愿意使用多个幼稚的贝叶斯分类器，并试图将概率相连以获得总体概率，但老实说，我不知道该怎么做，而且我敢肯定，我只是在这里错过了一些简单的东西。

的时间功能

删除我尝试仅运行text_features，即“标题”和“描述”

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/1900884535.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    392             if self._final_estimator != "passthrough":
    393                 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394                 self._final_estimator.fit(Xt, y, **fit_params_last_step)
    395 
    396         return self

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    661             Returns the instance itself.
    662         """
--> 663         X, y = self._check_X_y(X, y)
    664         _, n_features = X.shape
    665 

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in _check_X_y(self, X, y, reset)
    521     def _check_X_y(self, X, y, reset=True):
    522         """Validate X and y in fit methods."""
--> 523         return self._validate_data(X, y, accept_sparse="csr", reset=reset)
    524 
    525     def _update_class_log_prior(self, class_prior=None):

~/.local/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    979     y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
    980 
--> 981     check_consistent_length(X, y)
    982 
    983     return X, y

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:
--> 332         raise ValueError(
    333             "Found input variables with inconsistent numbers of samples: %r"
    334             % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]

，我会收到以下错误：并且我有以下形状：

X_train: (3001, 2)
y_train: (3001,)

的标签

重塑我尝试重新调整 y_train变量通过调用包装在[]之类的样本中，以便：这样

# new
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train[['class']], test_size=0.2, random_state=RANDOM_STATE)

# previous
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

，因此结果形状为：

X_train: (3001, 3)
y_train: (3001, 1)

但是不幸的是，这似乎并没有解决此问题。

删除天真的贝叶斯分类器

删除管道的最后一步时，请（naiveBayes.multinomialnb（）），然后删除text_features（“ timestamp”功能），然后i i可以构建一个预处理的预处理器，该处理程序适合文本。即，我可以预处理文本字段（“标题”，“描述”），但是当我添加分类器时，我会在“删除时间功能”下面的错误下得到错误。

原文

Background

I'm struggling to implement a Naive Bayes classifier in python with sklearn across multiple features.

The features I have are:

Title - some short text
Description - some longer text
Timestamp - a float representing an hour of the day (e.g. 18.0 = 6:00PM, 11.5 = 11:30AM)

The labels/classes are categorical strings: e.g. "Class1", "Class2", "Class3"

Aim

My goal is to use the 3 features in order to construct a Naive Bayes classifier for 3 features in order to predict the class label. I specifically wish to use all of the features at the same time, i.e. not simply the description feature.

Initial Approach

I have setup some pre-processing pipelines using sklearn as follows:

from sklearn import preprocessing, naive_bayes, feature_extraction, pipeline, model_selection, compose,

text_columns = ['title', 'description']
time_columns = ['timestamp']

# get an 80-20 test-train split
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

# convert the text data into vectors
text_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])

# preprocess by scaling the data, and binning the data
time_pipeline = pipeline.Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('bin', preprocessing.KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='quantile')),
])

# combine the pre-processors
preprocessor = compose.ColumnTransformer([
    ('text', text_pipeline, text_columns),
    ('time', time_pipeline, time_columns),
])

clf = pipeline.Pipeline([
    ('preprocessor', preprocessor),
    ('clf', naive_bayes.MultinomialNB()),
])

Here train is a pandas dataframe with the features and labels, read straight from a .csv file like this:

ID,title,description,timestamp,class
1,First Title String,"A description of the first title",13.0,Class1
2,Second Title String,"A description of the second title",17.5,Class2

Also note that I'm not setting most of the params for the transformers/classifiers, as I want to use a grid-search to find the optimum ones later on.

The problem

When I call clf.fit(X_train, y_train), I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/3039541201.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    388         """
    389         fit_params_steps = self._check_fit_params(**fit_params)
--> 390         Xt = self._fit(X, y, **fit_params_steps)
    391         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392             if self._final_estimator != "passthrough":

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    346                 cloned_transformer = clone(transformer)
    347             # Fit or load from cache the current transformer
--> 348             X, fitted_transformer = fit_transform_one_cached(
    349                 cloned_transformer,
    350                 X,

~/.local/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    697         self._record_output_indices(Xs)
    698 
--> 699         return self._hstack(list(Xs))
    700 
    701     def transform(self, X):

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
    789         else:
    790             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 791             return np.hstack(Xs)
    792 
    793     def _sk_visual_block_(self):

<__array_function__ internals> in hstack(*args, **kwargs)

~/.local/lib/python3.9/site-packages/numpy/core/shape_base.py in hstack(tup)
    344         return _nx.concatenate(arrs, 0)
    345     else:
--> 346         return _nx.concatenate(arrs, 1)
    347 
    348 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3001

I have the following shapes for X_train and y_train:

X_train: (3001, 3)
y_train: (3001,)

Steps Taken

Individual Features

I can use the same pipelines with individual features (by altering the text_features and time_features arrays), and get a perfectly fine classifier. E.g. only using the "title" field, or only using the "timestamp". Unfortunately, these individual features are not accurate enough, so I would like to use all the features to build a more accurate classifier. The issue seems to be when I attempt to combine more than one feature.

I'm open to potentially using multiple Naive Bayes classifiers, and trying to multiply the probabilities together to get some overall probability, but I honestly have no clue how to do that, and I'm sure I'm just missing something simple here.

Dropping the Time Features

I have tried running only the text_features, i.e. "title" and "description", and I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/1900884535.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    392             if self._final_estimator != "passthrough":
    393                 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394                 self._final_estimator.fit(Xt, y, **fit_params_last_step)
    395 
    396         return self

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    661             Returns the instance itself.
    662         """
--> 663         X, y = self._check_X_y(X, y)
    664         _, n_features = X.shape
    665 

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in _check_X_y(self, X, y, reset)
    521     def _check_X_y(self, X, y, reset=True):
    522         """Validate X and y in fit methods."""
--> 523         return self._validate_data(X, y, accept_sparse="csr", reset=reset)
    524 
    525     def _update_class_log_prior(self, class_prior=None):

~/.local/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    979     y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
    980 
--> 981     check_consistent_length(X, y)
    982 
    983     return X, y

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:
--> 332         raise ValueError(
    333             "Found input variables with inconsistent numbers of samples: %r"
    334             % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]

And I have the following shapes:

X_train: (3001, 2)
y_train: (3001,)

Reshaping the Labels

I have also tried reshaping y_train variable by calling it wrapped in [] like so:

# new
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train[['class']], test_size=0.2, random_state=RANDOM_STATE)

# previous
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

so that the resultant shapes are:

X_train: (3001, 3)
y_train: (3001, 1)

But unfortunately this doesn't appear to fix this.

Removing Naive Bayes Classifier

When I remove the final step of the pipeline (the naivebayes.MultinomialNB()), and I remove the text_features ("timestamp" feature), then I can build a pre-processor that works just fine for the text. I.e. I can pre-process the text fields ("title", "description"), but when I add the classifier, I get the error above under "Dropping the Time Features".

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

守望孤独 2025-01-31 15:31:45

矢量化多个文本功能时，您应该为每个功能创建countvectorizer（或tfidfvectorizer）实例：

title_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
description_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
preprocessor = compose.ColumnTransformer([
    ('title', title_pipeline, text_columns[0]),
    ('description', description_pipeline, text_columns[1]),
    ('time', time_pipeline, time_columns),
])

ps Count> Countvectorizer和tfidftransformer的组合等效于tfidfvectorizer。此外，您可以跳过TF-IDF加权，并仅使用CountVectorizer 多inimialnb。

When vectorizing multiple text features, you should create CountVectorizer (or TfidfVectorizer) instances for every feature:

title_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
description_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
preprocessor = compose.ColumnTransformer([
    ('title', title_pipeline, text_columns[0]),
    ('description', description_pipeline, text_columns[1]),
    ('time', time_pipeline, time_columns),
])

P.S. The combination of CountVectorizer and TfidfTransformer is equivalent to TfidfVectorizer. Also, you may just skip tf-idf weighting and use only CountVectorizer for MultinomialNB.

回复收藏 0 原文

~没有更多了~

关于作者

不如归去

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

具有多个功能的Sklearn幼稚贝叶斯

背景

目标

初始方法

我调用`clf.fit（x_train，y_train）`时的问题

步骤操作的

单个功能

的时间功能

的标签

删除天真的贝叶斯分类器

Background

Aim

Initial Approach

The problem

Steps Taken

Individual Features

Dropping the Time Features

Reshaping the Labels

Removing Naive Bayes Classifier

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

具有多个功能的Sklearn幼稚贝叶斯

背景

目标

初始方法

我调用clf.fit（x_train，y_train）时的问题

步骤操作的

单个功能

的时间功能

的标签

删除天真的贝叶斯分类器

Background

Aim

Initial Approach

The problem

Steps Taken

Individual Features

Dropping the Time Features

Reshaping the Labels

Removing Naive Bayes Classifier

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

我调用`clf.fit（x_train，y_train）`时的问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。