计算Bigram和差异的PMI
假设我有以下文本:
text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"
我可以使用NLTK计算BIGRAM的PMI,如下
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
print(i)
所示:
(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)
现在:要检查自己的理解,我想为PMI('Black','sheep')找到PMI。 PMI公式为:
文本中有4个'黑色'实例,文字中有3个'绵羊'实例,黑色和黑色和绵羊来了总共3次,文本的长度为23。现在遵循我做的公式:
np.log((3/23)/((4/23)*(3/23)))
这给出了1.749199854809259而不是2.523561956057013。我想知道为什么这里有差异?我在这里想念什么?
Suppose I have the following text:
text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"
I can calculate the PMI for bigram using NLTK as follow:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
print(i)
which gives:
(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)
Now to check my own understanding I want to find the PMI for PMI('black', 'sheep'). PMI formula is given as:
There are 4 instances of 'black' in the text, there are 3 instances of 'sheep' in the text and black and sheep come together 3 times, the length of the text is 23. Now following the formula I do:
np.log((3/23)/((4/23)*(3/23)))
That gives 1.749199854809259 rather than 2.523561956057013. I wonder why is there a discrepancy here? what am I missing here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的PMI公式在基本2中使用对数而不是基础e。
来自 numpy的文档 .log 是基础E中的自然对数,这不是您想要的。
以下公式将为您提供
2.523561956057013
的结果:Your PMI formula uses a logarithm in base 2 instead of a base e.
From NumPy's documentation,
numpy.log
is a Natural logarithm in base e, which is not what you want.The following formula would give you the result of
2.523561956057013
: