如何从NLTK自带的样本语料中提取单词?
NLTK 附带了一些语料库样本,位于: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
我只想有没有编码的文本。我不知道如何提取此类内容。我要提取的是
1)nps_chat:解压后文件名类似于10-19-20s_706posts.xml。 此类文件是 XML 格式,如下所示:
<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>
<t pos="RB" word="now"/>
<t pos="PRP" word="im"/>
<t pos="VBD" word="left"/>
<t pos="IN" word="with"/>
<t pos="DT" word="this"/>
<t pos="JJ" word="gay"/>
<t pos="NN" word="name"/>
</terminals>
</Post>
...
...
我只想要实际的帖子:
now im left with this gay name
在 NLTK 中如何做或(无论如何)在本地磁盘中剥离编码后保存裸帖子?
2) 总机成绩单。此类文件(解压后文件名为discourse)包含以下格式。我想要的是删除前面的标记:
o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
...
我只想:
Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
...
提前非常感谢。
NLTK comes with some samples of corpus at:
http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
I want to have only text without encodings. I do not know how to extract such content. What I want to extract are
1) nps_chat: filename is like 10-19-20s_706posts.xml after unzip.
Such file is XML format like:
<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>
<t pos="RB" word="now"/>
<t pos="PRP" word="im"/>
<t pos="VBD" word="left"/>
<t pos="IN" word="with"/>
<t pos="DT" word="this"/>
<t pos="JJ" word="gay"/>
<t pos="NN" word="name"/>
</terminals>
</Post>
...
...
I only want that actual post:
now im left with this gay name
How can do in NLTK or (whatever) to save the bare posts after stripping encodings in local disk?
2) switchboard transcript. This type of file (filename is discourse after unzip) contains the following formats. What I want is to strip preceding markers:
o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
...
I want to have only:
Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
...
Thank you very much in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,您需要为语料库制作一个
语料库阅读器
。您可以在nltk.corpus
中使用一些语料库阅读器,例如:一旦您从语料库中创建了语料库阅读器,如下所示:
您可以使用以下方法从语料库中获取单词 :以下代码:
这应该会为您提供语料库所有段落中所有单词的列表。
希望这有帮助
First, you need to make a
corpus reader
for the corpus. There are some corpus readers that you can use innltk.corpus
, such as:Once you've made a corpus reader out of your corpus like so:
you can get the words out of the corpus by using the following code:
This should get you a list of all the words in all the paragraphs of your corpus.
Hope this helps
您可以使用 nltk 语料库中的
.words()
属性content = nps_chat.words()
这将为您提供列表中的所有单词
['now', '我','左','与','这个','同性恋','名字',...]
You can use
.words()
property from nltk corpuscontent = nps_chat.words()
This will give you all the words in a list
['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]