具有多个功能的Sklearn幼稚贝叶斯

发布于 2025-01-24 15:31:45 字数 8948 浏览 3 评论 0原文

背景

我正在努力在python中以sklearn跨多个功能实现幼稚的贝叶斯分类器。

我拥有的功能是:

  1. 标题 - 一些简短的文本
  2. 描述 - 一些更长的文本
  3. 时间戳 - 代表一天中一个小时的浮点(例如18.0 = 6:00 = 6:00 pm,11.5 = 11:30 AM)

标签/类是分类字符串:例如” Class1“,“ class2”,“ class3”

目标

我的目标是使用3个功能,以构建3个功能的天真贝叶斯分类器,以预测类标签。我特别希望同时使用所有功能,即不仅仅是 Description 功能。

初始方法

我使用sklearn设置了一些预处理管道,如下所示:

from sklearn import preprocessing, naive_bayes, feature_extraction, pipeline, model_selection, compose,

text_columns = ['title', 'description']
time_columns = ['timestamp']

# get an 80-20 test-train split
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

# convert the text data into vectors
text_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])

# preprocess by scaling the data, and binning the data
time_pipeline = pipeline.Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('bin', preprocessing.KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='quantile')),
])

# combine the pre-processors
preprocessor = compose.ColumnTransformer([
    ('text', text_pipeline, text_columns),
    ('time', time_pipeline, time_columns),
])

clf = pipeline.Pipeline([
    ('preprocessor', preprocessor),
    ('clf', naive_bayes.MultinomialNB()),
])

trainpandas带有功能和标签从.csv这样的文件中:

ID,title,description,timestamp,class
1,First Title String,"A description of the first title",13.0,Class1
2,Second Title String,"A description of the second title",17.5,Class2

还请注意,我不是为变形金刚/分类器设置大多数参数,因为我想使用网格搜索以稍后查找最佳访问量。

我调用clf.fit(x_train,y_train)时的问题

,我会收到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/3039541201.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    388         """
    389         fit_params_steps = self._check_fit_params(**fit_params)
--> 390         Xt = self._fit(X, y, **fit_params_steps)
    391         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392             if self._final_estimator != "passthrough":

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    346                 cloned_transformer = clone(transformer)
    347             # Fit or load from cache the current transformer
--> 348             X, fitted_transformer = fit_transform_one_cached(
    349                 cloned_transformer,
    350                 X,

~/.local/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    697         self._record_output_indices(Xs)
    698 
--> 699         return self._hstack(list(Xs))
    700 
    701     def transform(self, X):

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
    789         else:
    790             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 791             return np.hstack(Xs)
    792 
    793     def _sk_visual_block_(self):

<__array_function__ internals> in hstack(*args, **kwargs)

~/.local/lib/python3.9/site-packages/numpy/core/shape_base.py in hstack(tup)
    344         return _nx.concatenate(arrs, 0)
    345     else:
--> 346         return _nx.concatenate(arrs, 1)
    347 
    348 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3001

我有以下形状的x_trainy_train y_train :

X_train: (3001, 3)
y_train: (3001,)

步骤操作的

单个功能

我可以使用具有单个功能的相同管道(通过更改text_featurestime_features arrays),并获得一个完美的分类器。例如,仅使用“标题”字段,或仅使用“时间戳”字段。不幸的是,这些单个功能不够准确,因此我想使用所有功能来构建更准确的分类器。这个问题似乎是当我尝试将多个功能结合起来时。

我愿意使用多个幼稚的贝叶斯分类器,并试图将概率相连以获得总体概率,但老实说,我不知道该怎么做,而且我敢肯定,我只是在这里错过了一些简单的东西。

的时间功能

删除我尝试仅运行text_features,即“标题”和“描述”

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/1900884535.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    392             if self._final_estimator != "passthrough":
    393                 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394                 self._final_estimator.fit(Xt, y, **fit_params_last_step)
    395 
    396         return self

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    661             Returns the instance itself.
    662         """
--> 663         X, y = self._check_X_y(X, y)
    664         _, n_features = X.shape
    665 

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in _check_X_y(self, X, y, reset)
    521     def _check_X_y(self, X, y, reset=True):
    522         """Validate X and y in fit methods."""
--> 523         return self._validate_data(X, y, accept_sparse="csr", reset=reset)
    524 
    525     def _update_class_log_prior(self, class_prior=None):

~/.local/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    979     y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
    980 
--> 981     check_consistent_length(X, y)
    982 
    983     return X, y

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:
--> 332         raise ValueError(
    333             "Found input variables with inconsistent numbers of samples: %r"
    334             % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]

,我会收到以下错误:并且我有以下形状:

X_train: (3001, 2)
y_train: (3001,)

的标签

重塑我尝试重新调整 y_train变量通过调用包装在[]之类的样本中,以便:这样

# new
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train[['class']], test_size=0.2, random_state=RANDOM_STATE)

# previous
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

,因此结果形状为:

X_train: (3001, 3)
y_train: (3001, 1)

但是不幸的是,这似乎并没有解决此问题。

删除天真的贝叶斯分类器

删除管道的最后一步时,请 (naiveBayes.multinomialnb()),然后删除text_features(“ timestamp”功能),然后i i可以构建一个预处理的预处理器,该处理程序适合文本。即,我可以预处理文本字段(“标题”,“描述”),但是当我添加分类器时,我会在“删除时间功能”下面的错误下得到错误。

Background

I'm struggling to implement a Naive Bayes classifier in python with sklearn across multiple features.

The features I have are:

  1. Title - some short text
  2. Description - some longer text
  3. Timestamp - a float representing an hour of the day (e.g. 18.0 = 6:00PM, 11.5 = 11:30AM)

The labels/classes are categorical strings: e.g. "Class1", "Class2", "Class3"

Aim

My goal is to use the 3 features in order to construct a Naive Bayes classifier for 3 features in order to predict the class label. I specifically wish to use all of the features at the same time, i.e. not simply the description feature.

Initial Approach

I have setup some pre-processing pipelines using sklearn as follows:

from sklearn import preprocessing, naive_bayes, feature_extraction, pipeline, model_selection, compose,

text_columns = ['title', 'description']
time_columns = ['timestamp']

# get an 80-20 test-train split
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

# convert the text data into vectors
text_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])

# preprocess by scaling the data, and binning the data
time_pipeline = pipeline.Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('bin', preprocessing.KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='quantile')),
])

# combine the pre-processors
preprocessor = compose.ColumnTransformer([
    ('text', text_pipeline, text_columns),
    ('time', time_pipeline, time_columns),
])

clf = pipeline.Pipeline([
    ('preprocessor', preprocessor),
    ('clf', naive_bayes.MultinomialNB()),
])

Here train is a pandas dataframe with the features and labels, read straight from a .csv file like this:

ID,title,description,timestamp,class
1,First Title String,"A description of the first title",13.0,Class1
2,Second Title String,"A description of the second title",17.5,Class2

Also note that I'm not setting most of the params for the transformers/classifiers, as I want to use a grid-search to find the optimum ones later on.

The problem

When I call clf.fit(X_train, y_train), I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/3039541201.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    388         """
    389         fit_params_steps = self._check_fit_params(**fit_params)
--> 390         Xt = self._fit(X, y, **fit_params_steps)
    391         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392             if self._final_estimator != "passthrough":

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    346                 cloned_transformer = clone(transformer)
    347             # Fit or load from cache the current transformer
--> 348             X, fitted_transformer = fit_transform_one_cached(
    349                 cloned_transformer,
    350                 X,

~/.local/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    347 
    348     def __call__(self, *args, **kwargs):
--> 349         return self.func(*args, **kwargs)
    350 
    351     def call_and_shelve(self, *args, **kwargs):

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    697         self._record_output_indices(Xs)
    698 
--> 699         return self._hstack(list(Xs))
    700 
    701     def transform(self, X):

~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
    789         else:
    790             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 791             return np.hstack(Xs)
    792 
    793     def _sk_visual_block_(self):

<__array_function__ internals> in hstack(*args, **kwargs)

~/.local/lib/python3.9/site-packages/numpy/core/shape_base.py in hstack(tup)
    344         return _nx.concatenate(arrs, 0)
    345     else:
--> 346         return _nx.concatenate(arrs, 1)
    347 
    348 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3001

I have the following shapes for X_train and y_train:

X_train: (3001, 3)
y_train: (3001,)

Steps Taken

Individual Features

I can use the same pipelines with individual features (by altering the text_features and time_features arrays), and get a perfectly fine classifier. E.g. only using the "title" field, or only using the "timestamp". Unfortunately, these individual features are not accurate enough, so I would like to use all the features to build a more accurate classifier. The issue seems to be when I attempt to combine more than one feature.

I'm open to potentially using multiple Naive Bayes classifiers, and trying to multiply the probabilities together to get some overall probability, but I honestly have no clue how to do that, and I'm sure I'm just missing something simple here.

Dropping the Time Features

I have tried running only the text_features, i.e. "title" and "description", and I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_7500/1900884535.py in <module>
     33 
     34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
     36 # # print the number of features
     37 

~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    392             if self._final_estimator != "passthrough":
    393                 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394                 self._final_estimator.fit(Xt, y, **fit_params_last_step)
    395 
    396         return self

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    661             Returns the instance itself.
    662         """
--> 663         X, y = self._check_X_y(X, y)
    664         _, n_features = X.shape
    665 

~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in _check_X_y(self, X, y, reset)
    521     def _check_X_y(self, X, y, reset=True):
    522         """Validate X and y in fit methods."""
--> 523         return self._validate_data(X, y, accept_sparse="csr", reset=reset)
    524 
    525     def _update_class_log_prior(self, class_prior=None):

~/.local/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    979     y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
    980 
--> 981     check_consistent_length(X, y)
    982 
    983     return X, y

~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:
--> 332         raise ValueError(
    333             "Found input variables with inconsistent numbers of samples: %r"
    334             % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]

And I have the following shapes:

X_train: (3001, 2)
y_train: (3001,)

Reshaping the Labels

I have also tried reshaping y_train variable by calling it wrapped in [] like so:

# new
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train[['class']], test_size=0.2, random_state=RANDOM_STATE)

# previous
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)

so that the resultant shapes are:

X_train: (3001, 3)
y_train: (3001, 1)

But unfortunately this doesn't appear to fix this.

Removing Naive Bayes Classifier

When I remove the final step of the pipeline (the naivebayes.MultinomialNB()), and I remove the text_features ("timestamp" feature), then I can build a pre-processor that works just fine for the text. I.e. I can pre-process the text fields ("title", "description"), but when I add the classifier, I get the error above under "Dropping the Time Features".

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

守望孤独 2025-01-31 15:31:45

矢量化多个文本功能时,您应该为每个功能创建countvectorizer(或tfidfvectorizer)实例:

title_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
description_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
preprocessor = compose.ColumnTransformer([
    ('title', title_pipeline, text_columns[0]),
    ('description', description_pipeline, text_columns[1]),
    ('time', time_pipeline, time_columns),
])

ps Count> Countvectorizertfidftransformer的组合等效于tfidfvectorizer。此外,您可以跳过TF-IDF加权,并仅使用CountVectorizer 多inimialnb

When vectorizing multiple text features, you should create CountVectorizer (or TfidfVectorizer) instances for every feature:

title_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
description_pipeline = pipeline.Pipeline([
    ('vect', feature_extraction.text.CountVectorizer()),
    ('tfidf', feature_extraction.text.TfidfTransformer()),
])
preprocessor = compose.ColumnTransformer([
    ('title', title_pipeline, text_columns[0]),
    ('description', description_pipeline, text_columns[1]),
    ('time', time_pipeline, time_columns),
])

P.S. The combination of CountVectorizer and TfidfTransformer is equivalent to TfidfVectorizer. Also, you may just skip tf-idf weighting and use only CountVectorizer for MultinomialNB.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文