如何使用Scikit-Learn找到LDA中最佳的主题？

发布于 2025-01-19 09:18:01 字数 727 浏览 5 评论 0原文

我正在通过Scikit-Learn与此脚本计算主题模型（我是从“ DF”开始的，该数据集“ DF”在“文本”中的每个行中有一个文档），

from sklearn.decomposition import LatentDirichletAllocation

#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+') 

# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()



number_of_topics = 6 
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

我对比较具有不同主题数量的模型（从2到20个主题）通过连贯的度量。我该怎么做？

原文

I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")

from sklearn.decomposition import LatentDirichletAllocation

#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+') 

# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()



number_of_topics = 6 
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?

分享到QQ

分享到微博