分析最常见的n-gram

发布于 2025-02-04 06:26:07 字数 663 浏览 1 评论 0原文

美好的一天，

我一直在研究NLP，并遇到了最高n-gram提取的该代码：

def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

我已经逐行越过此功能，而我不知道的部分是：

[（word，sum_words [0），idx]）），用于单词，vec.vocabulary_.items（）中的IDX]

我确实了解它的成就，但是我不了解的是如何。为什么要简单地从vec.vocabulary_.items（）提取IDX给我们不正确的计数？矩阵sum_words拥有什么？这些价值观是什么？谢谢。

原文

Good day,

I have been studying NLP and came across this code for top n-gram extraction:

def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

I have gone over this function line by line and the part i cannot figure out is this one:

[(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

I do understand what it achieves, but what I don't understand is how. Why simply extracting idx from vec.vocabulary_.items() gives us incorrect count? And what does the matrix sum_words hold? What are those values? Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

虐人心 2025-02-11 06:26:07

bag_of_words是通常的2维文档ngram频率矩阵，即它包含任何文档的任何ngram的频率（corpus可能包含任何数量的文档）。
sum_words为每个NGRAM的文档跨文档的频率总和。这是一个尺寸阵列。长度是词汇大小，索引的顺序与bag_of_words中的顺序相同。当然，它不包含ngram本身。

由于目标是获取顶级频繁的ngram，因此我们需要在sum_words中将每个ngram与其频率匹配。这就是为什么词汇（包含ngrams）与ngram和索引迭代：如果仅获得索引idx，则无法知道它代表哪个实际ngram。当然，索引用于在sum_words中获得总频率。因此，words_freq是每个ngram包含对（ngram，频率）的数组。

最后两行通过降低频率并提取顶部n元素来对此数组进行排序。