Python NLTK 标记断言错误

发布于 2024-10-18 13:00:43 字数 887 浏览 4 评论 0原文

当使用 NLTK 通过 PlainTextCorpusReader 处理大约 5000 个帖子时，我遇到了奇怪的断言错误。对于我们的一些数据集，我们没有任何重大问题。然而，在极少数情况下，我会遇到：

File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError

我的代码（基本上）像这样工作：

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())

看起来 nltk 正在失去它在文件缓冲区中的位置，但我并不是 100% 相信这一点。知道什么可能会导致这种情况发生吗？看起来它似乎与我正在处理的数据有关。也许是一些时髦的角色？

原文

I'm running into an odd assertion error when using NLTK to process around 5000 posts with the PlainTextCorpusReader. With some of our datasets we don't have any major issues. However, on the rare occasion I'm met with:

File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError

My code works (basically) like so:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())

It seems like nltk is losing its place in the file buffer, but I'm not 100% on that. Any idea what might cause this to happen? It almost seems like it has to have something to do with the data I'm processing. Maybe some funky characters?

分享到QQ

分享到微博