Python NLTK 标记断言错误
当使用 NLTK 通过 PlainTextCorpusReader 处理大约 5000 个帖子时,我遇到了奇怪的断言错误。对于我们的一些数据集,我们没有任何重大问题。然而,在极少数情况下,我会遇到:
File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError
我的代码(基本上)像这样工作:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())
看起来 nltk 正在失去它在文件缓冲区中的位置,但我并不是 100% 相信这一点。知道什么可能会导致这种情况发生吗?看起来它似乎与我正在处理的数据有关。也许是一些时髦的角色?
I'm running into an odd assertion error when using NLTK to process around 5000 posts with the PlainTextCorpusReader. With some of our datasets we don't have any major issues. However, on the rare occasion I'm met with:
File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError
My code works (basically) like so:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())
It seems like nltk is losing its place in the file buffer, but I'm not 100% on that. Any idea what might cause this to happen? It almost seems like it has to have something to do with the data I'm processing. Maybe some funky characters?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当一个写入函数使我的语料库变空时,我也遇到了这个问题。确保我们正在读取的文件不为空可以避免此错误。
I also faced this problem when one write function was making my corpora empty. making sure the file we are reading is not empty can avoid this error.
从解析中删除了一些空文件,问题解决了。
Removed some empty files from the parsing, problem solved.