如何使用预训练的模型-Python文本分类(NLTK和Scikit)对新数据进行分类
我对文本分类非常陌生,我正在尝试根据一些预定的主题对Twitter评论组成的数据集的每一行分类。
我已经使用Jupyter笔记本中的代码Bellow来构建和培训培训数据集的模型。我选择使用NLTK和Scikit在Python中使用监督方法,因为无监督的方法(如LDA)并没有给我很好的结果。
到目前为止,我遵循了这些步骤:
- 人工划分了培训数据集的主题;
- 将培训数据集应用于代码bellow并训练它,从而导致APROX的准确性。 82%。
现在,我想使用此模型自动对另一个数据集的主题进行分类(即测试数据集)。大多数帖子仅涵盖培训部件,因此,新的Commer了解如何获得训练有素的模型并实际使用它是令人沮丧的。
因此,问题是:使用以下代码,我现在如何使用训练有素的模型对新数据集进行分类?
我感谢您的帮助。
本教程非常好,我将其用作以下代码的参考: https://medium.com/@ishan16.d/text-classification-in-plassification-in-python-in-python-with-scikit-scikit-scikit-scikit-learn-and-nltk--nltk-891aa2d0ac4b
建筑和培训代码:
#Do library and methods import
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.tokenize import RegexpTokenizer
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk as nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import regex as re
import requests
# Import dataset
df = pd.read_csv(r'C:\Users\user_name\Downloads\Train_data.csv', delimiter=';')
# Tokenize
def tokenize(x):
tokenizer = RegexpTokenizer(r'\w+')
return tokenizer.tokenize(x)
df['tokens'] = df['Tweet'].map(tokenize)
# Stem and Lemmatize
nltk.download('wordnet')
nltk.download('omw-1.4')
def stemmer(x):
stemmer = PorterStemmer()
return ' '.join([stemmer.stem(word) for word in x])
def lemmatize(x):
lemmatizer = WordNetLemmatizer()
return ' '.join([lemmatizer.lemmatize(word) for word in x])
df['lemma'] = df['tokens'].map(lemmatize)
df['stems'] = df['tokens'].map(stemmer)
# set up feature matrix and target column
X = df['lemma']
y = df['Topic']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 13)
# Create out pipeline with a vectorizer and our naive Bayes classifier
pipe_mnnb = Pipeline(steps = [('tf', TfidfVectorizer()), ('mnnb', MultinomialNB())])
# Create parameter grid
pgrid_mnnb = {
'tf__max_features' : [1000, 2000, 3000],
'tf__stop_words' : ['english', None],
'tf__ngram_range' : [(1,1),(1,2)],
'tf__use_idf' : [True, False],
'mnnb__alpha' : [0.1, 0.5, 1]
}
# Set up the grid search and fit the model
gs_mnnb = GridSearchCV(pipe_mnnb,pgrid_mnnb,cv=5,n_jobs=-1)
gs_mnnb.fit(X_train, y_train)
# Check the score
gs_mnnb.score(X_train, y_train)
gs_mnnb.score(X_test, y_test)
# Check the parameters
gs_mnnb.best_params_
# Get predictions
preds_mnnb = gs_mnnb.predict(X)
df['preds'] = preds_mnnb
# Print resulting dataset
print(df.shape)
df.head()
I am very new to Text Classification and I am trying to classify each line of a dataset composed by twitter comments according to some pre-defined topics.
I have used the code bellow in Jupyter Notebook to build and train a model with a training dataset. I chose to use a supervised approach in python with NLTK and Scikit, as unsupervised ones (like LDA) were not giving me good results.
I followed these steps so far:
- Mannually categorised the topics of a training dataset;
- Applied the training dataset to the code bellow and trained it resulting in an accuracy of aprox. 82%.
Now, I want to use this model to automatically categorise the topics of another dataset (i.e., my test dataset). Most posts only cover the training part, so it's quite frustraiting for a newcommer to understand how to get the trained model and actually use it.
Hence, the question is: with the code below, how can I now use the trained model to classify a new dataset?
I appreciate your help.
This tutorial is very good, and I used it as a reference for the code below: https://medium.com/@ishan16.d/text-classification-in-python-with-scikit-learn-and-nltk-891aa2d0ac4b
My model building and training code:
#Do library and methods import
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.tokenize import RegexpTokenizer
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk as nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import regex as re
import requests
# Import dataset
df = pd.read_csv(r'C:\Users\user_name\Downloads\Train_data.csv', delimiter=';')
# Tokenize
def tokenize(x):
tokenizer = RegexpTokenizer(r'\w+')
return tokenizer.tokenize(x)
df['tokens'] = df['Tweet'].map(tokenize)
# Stem and Lemmatize
nltk.download('wordnet')
nltk.download('omw-1.4')
def stemmer(x):
stemmer = PorterStemmer()
return ' '.join([stemmer.stem(word) for word in x])
def lemmatize(x):
lemmatizer = WordNetLemmatizer()
return ' '.join([lemmatizer.lemmatize(word) for word in x])
df['lemma'] = df['tokens'].map(lemmatize)
df['stems'] = df['tokens'].map(stemmer)
# set up feature matrix and target column
X = df['lemma']
y = df['Topic']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 13)
# Create out pipeline with a vectorizer and our naive Bayes classifier
pipe_mnnb = Pipeline(steps = [('tf', TfidfVectorizer()), ('mnnb', MultinomialNB())])
# Create parameter grid
pgrid_mnnb = {
'tf__max_features' : [1000, 2000, 3000],
'tf__stop_words' : ['english', None],
'tf__ngram_range' : [(1,1),(1,2)],
'tf__use_idf' : [True, False],
'mnnb__alpha' : [0.1, 0.5, 1]
}
# Set up the grid search and fit the model
gs_mnnb = GridSearchCV(pipe_mnnb,pgrid_mnnb,cv=5,n_jobs=-1)
gs_mnnb.fit(X_train, y_train)
# Check the score
gs_mnnb.score(X_train, y_train)
gs_mnnb.score(X_test, y_test)
# Check the parameters
gs_mnnb.best_params_
# Get predictions
preds_mnnb = gs_mnnb.predict(X)
df['preds'] = preds_mnnb
# Print resulting dataset
print(df.shape)
df.head()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
似乎比训练之后,您只需要直接使用网格搜索器来完成验证步骤,在Sklearn库中,在训练后也将其用作最佳发现的超参数的模型。
因此,以X为X,这是您要评估并运行
preds_mnnb应该包含您期望的
It seems than after training you just have to do as for your validation step using directly the grid-searcher, which in sklearn library is also used after training as a model taking the best found hyperparameters.
So take a X which is what you want to evaluate and run
preds_mnnb should contain what you expect