如何测量Sklearn LDA模型(非Gensim LDA模型)中的相干性?
我尝试使用两种技术,但是我得到的结果不同。我只想确定要去哪一个。
方法1: 我尝试使用
from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
metric_coherence_gensim(measure='u_mass',
top_n=25,
topic_word_distrib = lda.components_,
dtm = dtm,
vocab=np.array([x for x in tfidf_vect.vocabulary_.keys()]),
return_mean = True)
来源提到的是,不错的连贯分数应在-14至+14之间。对此的任何解释也有帮助。
方法2: 我必须编写功能来计算分数而没有任何内置库。
def get_umass_score(dt_matrix, i, j):
zo_matrix = (dt_matrix > 0).astype(int)
col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
col_ij = col_i + col_j
col_ij = (col_ij == 2).astype(int)
Di, Dij = col_i.sum(), col_ij.sum()
return math.log((Dij + 1) / Di)
def get_topic_coherence(dt_matrix, topic, n_top_words):
indexed_topic = zip(topic, range(0, len(topic)))
topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
coherence = 0
for j_index in range(0, len(topic_top)):
for i_index in range(0, j_index - 1):
i = topic_top[i_index][1]
j = topic_top[j_index][1]
coherence += get_umass_score(dt_matrix, i, j)
return coherence
def get_average_topic_coherence(dt_matrix, topics, n_top_words):
total_coherence = 0
for i in range(0, len(topics)):
total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
return total_coherence / len(topics)
我从stackoverflow帖子中得到了这个。写信给那个写这篇文章的人,但是根据我对n_top_words的价值,我获得了巨大的价值。
有人可以告诉我哪种方法可靠,还是有什么更好的方法可以找到Sklearn LDA型号的连贯分数?
I have tried using two techniques, but I am getting different results. I just want to be sure about which one to go with.
Method 1:
I tried using
from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
metric_coherence_gensim(measure='u_mass',
top_n=25,
topic_word_distrib = lda.components_,
dtm = dtm,
vocab=np.array([x for x in tfidf_vect.vocabulary_.keys()]),
return_mean = True)
The source mentioned that a decent coherence score should be between -14 to +14. Any explanation on this also helps.
Method 2:
I had to write functions to calculate the score without any in-built library.
def get_umass_score(dt_matrix, i, j):
zo_matrix = (dt_matrix > 0).astype(int)
col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
col_ij = col_i + col_j
col_ij = (col_ij == 2).astype(int)
Di, Dij = col_i.sum(), col_ij.sum()
return math.log((Dij + 1) / Di)
def get_topic_coherence(dt_matrix, topic, n_top_words):
indexed_topic = zip(topic, range(0, len(topic)))
topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
coherence = 0
for j_index in range(0, len(topic_top)):
for i_index in range(0, j_index - 1):
i = topic_top[i_index][1]
j = topic_top[j_index][1]
coherence += get_umass_score(dt_matrix, i, j)
return coherence
def get_average_topic_coherence(dt_matrix, topics, n_top_words):
total_coherence = 0
for i in range(0, len(topics)):
total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
return total_coherence / len(topics)
I got this from a StackOverflow post. Credits to that guy who wrote this, but I was getting huge value depending on the value I pass for n_top_words.
Can someone tell me which method is reliable, or is there any better way I can find the coherence score for sklearn LDA models?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论