如何使用Scikit-Learn找到LDA中最佳的主题?

发布于 2025-01-19 09:18:01 字数 727 浏览 0 评论 0原文

我正在通过Scikit-Learn与此脚本计算主题模型(我是从“ DF”开始的,该数据集“ DF”在“文本”中的每个行中有一个文档),

from sklearn.decomposition import LatentDirichletAllocation

#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+') 

# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()



number_of_topics = 6 
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

我对比较具有不同主题数量的模型(从2到20个主题)通过连贯的度量。我该怎么做?

I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")

from sklearn.decomposition import LatentDirichletAllocation

#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+') 

# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()



number_of_topics = 6 
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文