【LDA】gensim算出的“文章属于主题的概率”,转化为list后为什么内容会有变化?
一、问题
用python2.7的gensim包做lda,把文章属于主题的概率存储在变量corpus_lda中,如以下代码所示:
corpus_lda = lda[corpus_tfidf]
print 'type(corpus_lda) = ', type(corpus_lda)
lcorpus_tfidf = list(corpus_lda)
for i in range(len(corpus_lda)):
print 'list(corpus_lda)[i] = \t', list(corpus_lda)[i]
print 'corpus_lda[i] = \t',corpus_lda[i]
为什么list(corpus_lda)的内容和corpus_lda的内容不一致?
二、完整代码:
# encoding: utf-8
import datetime
st = datetime.datetime.now()
import jieba, os
from gensim import corpora, models, similarities
with open('../Data/all_stopword.txt', 'rb') as f:
stopWords_set = {line.strip().decode('utf-8') for line in f}
print 'len of stop = ', len(stopWords_set)
# path = '../Data/Reduced/C000008/'
path = '../Data/wx_test/'
walkList = list(os.walk(path)) # must be run once!
# print 'len of len(walkList) = ', len(walkList)
rootPath = (walkList)[0]
root = rootPath[0]
dirs = rootPath[1]
files = rootPath[2]
print 'root = ', root
print 'dirs = ', dirs
print 'len(files) = ', len(files)
print type(root), type(dirs), type(files)
train_set = []
for name in files:
f = open(os.path.join(root, name), 'r')
#print f.name
raw = f.read()
f.close()
word_list = list(jieba.cut(raw, cut_all = False))
words = []
for word in word_list:
if word not in stopWords_set and word != u' ':
if word == u'nbsp':
print os.path.join(root, name)
words.append(word)
words = [word for word in word_list if word not in stopWords_set and word != ' ' and word != u'nbsp']
train_set.append(words)
# word and its id
dic = corpora.Dictionary(train_set)
dic.save('./deerwester.dict') # store the dictionary, for future reference
print 'dic = ', dic
#print(dic.token2id)
corpus = [dic.doc2bow(text) for text in train_set]
# print 'corpus = ', corpus
tfidf = models.TfidfModel(corpus)
print 'tfidf = ', tfidf
corpus_tfidf = tfidf[corpus]
print 'corpus_tfidf = ', corpus_tfidf
sum = 2
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = sum)
for i in lda.print_topics(sum):
print i[0], i[1].encode('utf-8')
corpus_lda = lda[corpus_tfidf]
print 'type(corpus_lda) = ', type(corpus_lda)
lcorpus_tfidf = list(corpus_lda)
for i in range(len(corpus_lda)):
print 'list(corpus_lda)[i] = \t', list(corpus_lda)[i]
print 'corpus_lda[i] = \t',corpus_lda[i]
if (corpus_lda[i][0][1]<0.5) != (list(corpus_lda)[i][0][1]<0.5):
print 'ERROR!!!'
et = datetime.datetime.now()
print 'run time = ', (et - st).seconds
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论