如何测量Sklearn LDA模型（非Gensim LDA模型）中的相干性？

发布于 2025-01-26 09:34:41 字数 1711 浏览 3 评论 0原文

我尝试使用两种技术，但是我得到的结果不同。我只想确定要去哪一个。

方法1： 我尝试使用

from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
metric_coherence_gensim(measure='u_mass', 
                        top_n=25, 
                        topic_word_distrib = lda.components_, 
                        dtm = dtm, 
                        vocab=np.array([x for x in tfidf_vect.vocabulary_.keys()]),
                        return_mean = True)

来源提到的是，不错的连贯分数应在-14至+14之间。对此的任何解释也有帮助。

方法2： 我必须编写功能来计算分数而没有任何内置库。

def get_umass_score(dt_matrix, i, j):
    zo_matrix = (dt_matrix > 0).astype(int)
    col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
    col_ij = col_i + col_j
    col_ij = (col_ij == 2).astype(int)    
    Di, Dij = col_i.sum(), col_ij.sum()    
    return math.log((Dij + 1) / Di)

def get_topic_coherence(dt_matrix, topic, n_top_words):
    indexed_topic = zip(topic, range(0, len(topic)))
    topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
    coherence = 0
    for j_index in range(0, len(topic_top)):
        for i_index in range(0, j_index - 1):
            i = topic_top[i_index][1]
            j = topic_top[j_index][1]
            coherence += get_umass_score(dt_matrix, i, j)
    return coherence


def get_average_topic_coherence(dt_matrix, topics, n_top_words):
    total_coherence = 0
    for i in range(0, len(topics)):
        total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
    return total_coherence / len(topics)

我从stackoverflow帖子中得到了这个。写信给那个写这篇文章的人，但是根据我对n_top_words的价值，我获得了巨大的价值。

有人可以告诉我哪种方法可靠，还是有什么更好的方法可以找到Sklearn LDA型号的连贯分数？

原文

I have tried using two techniques, but I am getting different results. I just want to be sure about which one to go with.

Method 1:
I tried using

from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
metric_coherence_gensim(measure='u_mass', 
                        top_n=25, 
                        topic_word_distrib = lda.components_, 
                        dtm = dtm, 
                        vocab=np.array([x for x in tfidf_vect.vocabulary_.keys()]),
                        return_mean = True)

The source mentioned that a decent coherence score should be between -14 to +14. Any explanation on this also helps.

Method 2:
I had to write functions to calculate the score without any in-built library.

def get_umass_score(dt_matrix, i, j):
    zo_matrix = (dt_matrix > 0).astype(int)
    col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
    col_ij = col_i + col_j
    col_ij = (col_ij == 2).astype(int)    
    Di, Dij = col_i.sum(), col_ij.sum()    
    return math.log((Dij + 1) / Di)

def get_topic_coherence(dt_matrix, topic, n_top_words):
    indexed_topic = zip(topic, range(0, len(topic)))
    topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
    coherence = 0
    for j_index in range(0, len(topic_top)):
        for i_index in range(0, j_index - 1):
            i = topic_top[i_index][1]
            j = topic_top[j_index][1]
            coherence += get_umass_score(dt_matrix, i, j)
    return coherence


def get_average_topic_coherence(dt_matrix, topics, n_top_words):
    total_coherence = 0
    for i in range(0, len(topics)):
        total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
    return total_coherence / len(topics)

I got this from a StackOverflow post. Credits to that guy who wrote this, but I was getting huge value depending on the value I pass for n_top_words.

Can someone tell me which method is reliable, or is there any better way I can find the coherence score for sklearn LDA models?

分享到QQ

分享到微博