如何测量Sklearn LDA模型(非Gensim LDA模型)中的相干性?

发布于 2025-01-26 09:34:41 字数 1711 浏览 3 评论 0原文

我尝试使用两种技术,但是我得到的结果不同。我只想确定要去哪一个。

方法1: 我尝试使用

from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
metric_coherence_gensim(measure='u_mass', 
                        top_n=25, 
                        topic_word_distrib = lda.components_, 
                        dtm = dtm, 
                        vocab=np.array([x for x in tfidf_vect.vocabulary_.keys()]),
                        return_mean = True)

来源提到的是,不错的连贯分数应在-14至+14之间。对此的任何解释也有帮助。

方法2: 我必须编写功能来计算分数而没有任何内置库。

def get_umass_score(dt_matrix, i, j):
    zo_matrix = (dt_matrix > 0).astype(int)
    col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
    col_ij = col_i + col_j
    col_ij = (col_ij == 2).astype(int)    
    Di, Dij = col_i.sum(), col_ij.sum()    
    return math.log((Dij + 1) / Di)

def get_topic_coherence(dt_matrix, topic, n_top_words):
    indexed_topic = zip(topic, range(0, len(topic)))
    topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
    coherence = 0
    for j_index in range(0, len(topic_top)):
        for i_index in range(0, j_index - 1):
            i = topic_top[i_index][1]
            j = topic_top[j_index][1]
            coherence += get_umass_score(dt_matrix, i, j)
    return coherence


def get_average_topic_coherence(dt_matrix, topics, n_top_words):
    total_coherence = 0
    for i in range(0, len(topics)):
        total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
    return total_coherence / len(topics)

我从stackoverflow帖子中得到了这个。写信给那个写这篇文章的人,但是根据我对n_top_words的价值,我获得了巨大的价值。

有人可以告诉我哪种方法可靠,还是有什么更好的方法可以找到Sklearn LDA型号的连贯分数?

I have tried using two techniques, but I am getting different results. I just want to be sure about which one to go with.

Method 1:
I tried using

from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
metric_coherence_gensim(measure='u_mass', 
                        top_n=25, 
                        topic_word_distrib = lda.components_, 
                        dtm = dtm, 
                        vocab=np.array([x for x in tfidf_vect.vocabulary_.keys()]),
                        return_mean = True)

The source mentioned that a decent coherence score should be between -14 to +14. Any explanation on this also helps.

Method 2:
I had to write functions to calculate the score without any in-built library.

def get_umass_score(dt_matrix, i, j):
    zo_matrix = (dt_matrix > 0).astype(int)
    col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
    col_ij = col_i + col_j
    col_ij = (col_ij == 2).astype(int)    
    Di, Dij = col_i.sum(), col_ij.sum()    
    return math.log((Dij + 1) / Di)

def get_topic_coherence(dt_matrix, topic, n_top_words):
    indexed_topic = zip(topic, range(0, len(topic)))
    topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
    coherence = 0
    for j_index in range(0, len(topic_top)):
        for i_index in range(0, j_index - 1):
            i = topic_top[i_index][1]
            j = topic_top[j_index][1]
            coherence += get_umass_score(dt_matrix, i, j)
    return coherence


def get_average_topic_coherence(dt_matrix, topics, n_top_words):
    total_coherence = 0
    for i in range(0, len(topics)):
        total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
    return total_coherence / len(topics)

I got this from a StackOverflow post. Credits to that guy who wrote this, but I was getting huge value depending on the value I pass for n_top_words.

Can someone tell me which method is reliable, or is there any better way I can find the coherence score for sklearn LDA models?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文