计算Bigram和差异的PMI

发布于 2025-01-21 05:34:49 字数 1308 浏览 3 评论 0原文

假设我有以下文本:

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

我可以使用NLTK计算BIGRAM的PMI,如下

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
    print(i)

所示:

(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)

现在:要检查自己的理解,我想为PMI('Black','sheep')找到PMI。 PMI公式为:

”

文本中有4个'黑色'实例,文字中有3个'绵羊'实例,黑色和黑色和绵羊来了总共3次​​,文本的长度为23。现在遵循我做的公式:

np.log((3/23)/((4/23)*(3/23)))

这给出了1.749199854809259而不是2.523561956057013。我想知道为什么这里有差异?我在这里想念什么?

Suppose I have the following text:

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

I can calculate the PMI for bigram using NLTK as follow:

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
    print(i)

which gives:

(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)

Now to check my own understanding I want to find the PMI for PMI('black', 'sheep'). PMI formula is given as:

$$ pmi(w1,w2) = \ $$

There are 4 instances of 'black' in the text, there are 3 instances of 'sheep' in the text and black and sheep come together 3 times, the length of the text is 23. Now following the formula I do:

np.log((3/23)/((4/23)*(3/23)))

That gives 1.749199854809259 rather than 2.523561956057013. I wonder why is there a discrepancy here? what am I missing here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

无声静候 2025-01-28 05:34:49

您的PMI公式在基本2中使用对数而不是基础e。

来自 numpy的文档 .log 是基础E中的自然对数,这不是您想要的。

以下公式将为您提供2.523561956057013的结果:

math.log((3/23)/((4/23)*(3/23)), 2)

Your PMI formula uses a logarithm in base 2 instead of a base e.

From NumPy's documentation, numpy.log is a Natural logarithm in base e, which is not what you want.

The following formula would give you the result of 2.523561956057013:

math.log((3/23)/((4/23)*(3/23)), 2)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文