分析最常见的n-gram
美好的一天,
我一直在研究NLP,并遇到了最高n-gram提取的该代码:
def get_top_tweet_bigrams(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
我已经逐行越过此功能,而我不知道的部分是:
[(word,sum_words [0) ,idx])),用于单词,vec.vocabulary_.items()中的IDX]
我确实了解它的成就,但是我不了解的是如何。为什么要简单地从vec.vocabulary_.items()提取IDX给我们不正确的计数?矩阵sum_words拥有什么?这些价值观是什么?谢谢。
Good day,
I have been studying NLP and came across this code for top n-gram extraction:
def get_top_tweet_bigrams(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
I have gone over this function line by line and the part i cannot figure out is this one:
[(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
I do understand what it achieves, but what I don't understand is how. Why simply extracting idx from vec.vocabulary_.items() gives us incorrect count? And what does the matrix sum_words hold? What are those values? Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
bag_of_words
是通常的2维文档ngram频率矩阵,即它包含任何文档的任何ngram的频率(corpus
可能包含任何数量的文档)。sum_words
为每个NGRAM的文档跨文档的频率总和。这是一个尺寸阵列。长度是词汇大小,索引的顺序与bag_of_words
中的顺序相同。当然,它不包含ngram本身。由于目标是获取顶级频繁的ngram,因此我们需要在
sum_words
中将每个ngram与其频率匹配。这就是为什么词汇(包含ngrams)与ngram和索引迭代:如果仅获得索引idx
,则无法知道它代表哪个实际ngram。当然,索引用于在sum_words
中获得总频率。因此,words_freq
是每个ngram包含对(ngram,频率)的数组。最后两行通过降低频率并提取顶部
n
元素来对此数组进行排序。bag_of_words
is the usual 2-dimension document-ngram frequency matrix, i.e. it contains the frequency of any ngram for any document (corpus
might contain any number of documents).sum_words
obtains the sum of the frequency across documents for every ngram. It's a single dimension array. Length is the vocabulary size, and the indexes are in the same order as inbag_of_words
. It doesn't contain the ngrams themselves, of course.Since the goal is to obtain the top frequent ngrams, we need to match every ngram to its frequency in
sum_words
. This is why the vocabulary (which contains the ngrams) is iterated with the ngram and the index: if only the indexidx
was obtained, there would be no way to know which actual ngram it represents. And of course the index is used to obtain the total frequency insum_words
. Thuswords_freq
is an array containing pairs (ngram, frequency) for every ngram.The last 2 lines sort this array by decreasing frequency and extract the top
n
elements.