nltk 语言模型(ngram)根据上下文计算单词的概率
我正在使用Python和NLTK构建语言模型,如下所示:
from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)
# Thanks to miku, I fixed this problem
print lm.prob("word", ["This is a context which generates a word"])
>> 0.00493261081006
# But I got another program like this one...
print lm.prob("b", ["This is a context which generates a word"])
但它似乎不起作用。结果如下:
>>> print lm.prob("word", "This is a context which generates a word")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 82, in prob
"context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting
谁能帮帮我吗?谢谢!
I am using Python and NLTK to build a language model as follows:
from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)
# Thanks to miku, I fixed this problem
print lm.prob("word", ["This is a context which generates a word"])
>> 0.00493261081006
# But I got another program like this one...
print lm.prob("b", ["This is a context which generates a word"])
But it doesn't seem to work. The result is as follows:
>>> print lm.prob("word", "This is a context which generates a word")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
return self._alpha(context) * self._backoff.prob(word, context[1:])
File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 82, in prob
"context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting
Can anyone help me out? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我知道这个问题很老了,但每次我谷歌 nltk 的 NgramModel 类时它都会弹出。 NgramModel 的 prob 实现有点不直观。提问者很困惑。据我所知,答案并不好。由于我不经常使用 NgramModel,这意味着我会感到困惑。不再。
源代码位于此处: https://github.com/nltk/nltk/blob/ master/nltk/model/ngram.py。下面是 NgramModel 的 prob 方法的定义:
(注意: 'self[context].prob(word) 等价于 'self._model[context].prob(word)')
好的。现在至少我们知道要寻找什么。上下文需要是什么?让我们看一下构造函数的摘录:
好吧。构造函数根据条件频率分布创建条件概率分布 (self._model),其“上下文”是一元组的元组。这告诉我们“上下文”不应该是字符串或包含单个多单词字符串的列表。 “context”必须是包含一元组的可迭代内容。事实上,要求更严格一些。这些元组或列表的大小必须为 n-1。这样想吧。你告诉它是一个卦模型。你最好给它适当的八卦上下文。
让我们用一个更简单的例子来看看这个实际情况:(
顺便说一句,实际上尝试用 MLE 作为 NgramModel 中的估计器做任何事情都是一个坏主意。事情会崩溃。我保证。)
至于最初的问题,我假设我对OP想要的最好的猜测是:
......但是这里发生了很多误解,我无法说出他实际上想做什么。
I know this question is old but it pops up every time I google nltk's NgramModel class. NgramModel's prob implementation is a little unintuitive. The asker is confused. As far as I can tell, the answers aren't great. Since I don't use NgramModel often, this means I get confused. No more.
The source code lives here: https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py. Here is the definition of NgramModel's prob method:
(note: 'self[context].prob(word) is equivalent to 'self._model[context].prob(word)')
Okay. Now at least we know what to look for. What does context need to be? Let's look at an excerpt from the constructor:
Alright. The constructor creates a conditional probability distribution (self._model) out of a conditional frequency distribution whose "context" is tuples of unigrams. This tells us 'context' should not be a string or a list with a single multi-word string. 'context' MUST be something iterable containing unigrams. In fact, the requirement is a little more strict. These tuples or lists must be of size n-1. Think of it this way. You told it to be a trigram model. You better give it the appropriate context for trigrams.
Let's see this in action with a simpler example:
(As a side note, actually trying to do anything with MLE as your estimator in NgramModel is a bad idea. Things will fall apart. I guarantee it.)
As for the original question, I suppose my best guess at what OP wants is this:
...but there are so many misunderstandings going on here that I can't possible tell what he was actually trying to do.
快速修复:
Quick fix:
关于你的第二个问题:发生这种情况是因为
"b"
没有出现在布朗语料库类别news
中,你可以通过以下方式验证:虽然
我承认错误消息是非常神秘,因此您可能需要向 NLTK 作者提交错误报告。
As regards your second question: this happens because
"b"
doesn't occur in the Brown corpus categorynews
, as you can verify with:whereas
I admit the error message is very cryptic, so you might want to file a bug report with the NLTK authors.
我暂时远离 NLTK 的 NgramModel。当前存在一个平滑错误,导致模型在 n>1 时大大高估可能性。如果您最终使用 NgramModel,您绝对应该应用 git 问题跟踪器中提到的修复:https ://github.com/nltk/nltk/issues/367
I would stay away from NLTK's NgramModel for the time being. There is currently a smoothing bug that causes the model to greatly overestimate likelihoods when n>1. If you do end up using NgramModel, you should definitely apply the fix mentioned in the git issue tracker here: https://github.com/nltk/nltk/issues/367