使用 NLTK 创建新语料库
我认为我的标题的答案通常是去阅读文档,但我浏览了 NLTK 书 但它没有给出答案。我对 Python 有点陌生。
我有一堆 .txt
文件,我希望能够使用 NLTK 为语料库 nltk_data
提供的语料库函数。
我尝试过 PlaintextCorpusReader
但我无法进一步了解:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I split the newcorpus
Sentence using punkt?我尝试使用 punkt 函数,但 punkt 函数无法读取 PlaintextCorpusReader
类?
您还可以指导我如何将分段数据写入文本文件吗?
I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.
I have a bunch of .txt
files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data
.
I've tried PlaintextCorpusReader
but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus
sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader
class?
Can you also lead me to how I can write the segmented data into text files?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
经过几年的了解它是如何工作的,这里有
如何使用文本文件目录创建 NLTK 语料库?
的更新教程,主要思想是利用 nltk.corpus.reader 包。如果您有一个英语文本文件目录,最好使用PlaintextCorpusReader。
如果您有一个如下所示的目录:
只需使用这些代码行,您就可以获得一个语料库:
注意:
PlaintextCorpusReader
将使用默认的nltk .tokenize.sent_tokenize()
和nltk.tokenize.word_tokenize()
将文本拆分为句子和单词,这些函数是为英语构建的,可能不 适用于所有语言。以下是创建测试文本文件、如何使用 NLTK 创建语料库以及如何访问不同级别的语料库
的完整代码:最后,要读取文本目录并以其他语言创建 NLTK 语料库,您必须首先确保您有一个Python可调用的单词标记化和句子标记化模块,它接受字符串/基本字符串输入并产生这样的输出:
After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
Simply use these lines of code and you can get a corpus:
NOTE: that the
PlaintextCorpusReader
will use the defaultnltk.tokenize.sent_tokenize()
andnltk.tokenize.word_tokenize()
to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:
我认为
PlaintextCorpusReader
已经使用 punkt 分词器对输入进行了分段,至少如果您的输入语言是英语的话。PlainTextCorpusReader 的构造函数
可以向读者传递单词和句子标记器,但对于后者,默认值已经是 nltk.data.LazyLoader('tokenizers/punkt/english.pickle')。
对于单个字符串,分词器将按如下方式使用(此处解释,有关 punkt 分词器,请参阅第 5 节)。
I think the
PlaintextCorpusReader
already segments the input with a punkt tokenizer, at least if your input language is english.PlainTextCorpusReader's constructor
You can pass the reader a word and sentence tokenizer, but for the latter the default already is
nltk.data.LazyLoader('tokenizers/punkt/english.pickle')
.For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).