计算Bigram和差异的PMI

发布于 2025-01-21 05:34:49 字数 1308 浏览 3 评论 0原文

假设我有以下文本：

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

我可以使用NLTK计算BIGRAM的PMI，如下

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
    print(i)

所示：

(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)

现在：要检查自己的理解，我想为PMI（'Black'，'sheep'）找到PMI。 PMI公式为：

文本中有4个'黑色'实例，文字中有3个'绵羊'实例，黑色和黑色和绵羊来了总共3次，文本的长度为23。现在遵循我做的公式：

np.log((3/23)/((4/23)*(3/23)))

这给出了1.749199854809259而不是2.523561956057013。我想知道为什么这里有差异？我在这里想念什么？

原文

Suppose I have the following text:

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

I can calculate the PMI for bigram using NLTK as follow:

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
    print(i)

which gives:

(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)

Now to check my own understanding I want to find the PMI for PMI('black', 'sheep'). PMI formula is given as:

$pmi(w1,w2) = \$

There are 4 instances of 'black' in the text, there are 3 instances of 'sheep' in the text and black and sheep come together 3 times, the length of the text is 23. Now following the formula I do: