分析最常见的n-gram

发布于 2025-02-04 06:26:07 字数 663 浏览 1 评论 0原文

美好的一天,

我一直在研究NLP,并遇到了最高n-gram提取的该代码:

def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

我已经逐行越过此功能,而我不知道的部分是:

[(word,sum_words [0) ,idx])),用于单词,vec.vocabulary_.items()中的IDX]

我确实了解它的成就,但是我不了解的是如何。为什么要简单地从vec.vocabulary_.items()提取IDX给我们不正确的计数?矩阵sum_words拥有什么?这些价值观是什么?谢谢。

Good day,

I have been studying NLP and came across this code for top n-gram extraction:

def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

I have gone over this function line by line and the part i cannot figure out is this one:

[(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

I do understand what it achieves, but what I don't understand is how. Why simply extracting idx from vec.vocabulary_.items() gives us incorrect count? And what does the matrix sum_words hold? What are those values? Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

虐人心 2025-02-11 06:26:07
  • bag_of_words是通常的2维文档ngram频率矩阵,即它包含任何文档的任何ngram的频率(corpus可能包含任何数量的文档)。
  • sum_words为每个NGRAM的文档跨文档的频率总和。这是一个尺寸阵列。长度是词汇大小,索引的顺序与bag_of_words中的顺序相同。当然,它不包含ngram本身。

由于目标是获取顶级频繁的ngram,因此我们需要在sum_words中将每个ngram与其频率匹配。这就是为什么词汇(包含ngrams)与ngram和索引迭代:如果仅获得索引idx,则无法知道它代表哪个实际ngram。当然,索引用于在sum_words中获得总频率。因此,words_freq是每个ngram包含对(ngram,频率)的数组。

最后两行通过降低频率并提取顶部n元素来对此数组进行排序。

  • bag_of_words is the usual 2-dimension document-ngram frequency matrix, i.e. it contains the frequency of any ngram for any document (corpus might contain any number of documents).
  • sum_words obtains the sum of the frequency across documents for every ngram. It's a single dimension array. Length is the vocabulary size, and the indexes are in the same order as in bag_of_words. It doesn't contain the ngrams themselves, of course.

Since the goal is to obtain the top frequent ngrams, we need to match every ngram to its frequency in sum_words. This is why the vocabulary (which contains the ngrams) is iterated with the ngram and the index: if only the index idx was obtained, there would be no way to know which actual ngram it represents. And of course the index is used to obtain the total frequency in sum_words. Thus words_freq is an array containing pairs (ngram, frequency) for every ngram.

The last 2 lines sort this array by decreasing frequency and extract the top n elements.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文