Python NLTK 标记断言错误

发布于 2024-10-18 13:00:43 字数 887 浏览 4 评论 0原文

当使用 NLTK 通过 PlainTextCorpusReader 处理大约 5000 个帖子时,我遇到了奇怪的断言错误。对于我们的一些数据集,我们没有任何重大问题。然而,在极少数情况下,我会遇到:

File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError

我的代码(基本上)像这样工作:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())

看起来 nltk 正在失去它在文件缓冲区中的位置,但我并不是 100% 相信这一点。知道什么可能会导致这种情况发生吗?看起来它似乎与我正在处理的数据有关。也许是一些时髦的角色?

I'm running into an odd assertion error when using NLTK to process around 5000 posts with the PlainTextCorpusReader. With some of our datasets we don't have any major issues. However, on the rare occasion I'm met with:

File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError

My code works (basically) like so:

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())

It seems like nltk is losing its place in the file buffer, but I'm not 100% on that. Any idea what might cause this to happen? It almost seems like it has to have something to do with the data I'm processing. Maybe some funky characters?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

胡渣熟男 2024-10-25 13:00:43

当一个写入函数使我的语料库变空时,我也遇到了这个问题。确保我们正在读取的文件不为空可以避免此错误。

I also faced this problem when one write function was making my corpora empty. making sure the file we are reading is not empty can avoid this error.

月依秋水 2024-10-25 13:00:43

从解析中删除了一些空文件,问题解决了。

Removed some empty files from the parsing, problem solved.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文