如何从NLTK自带的样本语料中提取单词?

发布于 2024-10-13 07:45:30 字数 1664 浏览 5 评论 0原文

NLTK 附带了一些语料库样本,位于: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

我只想有没有编码的文本。我不知道如何提取此类内容。我要提取的是

1)nps_chat:解压后文件名类似于10-19-20s_706posts.xml。 此类文件是 XML 格式,如下所示:

<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>

                <t pos="RB" word="now"/>
                <t pos="PRP" word="im"/>
                <t pos="VBD" word="left"/>
                <t pos="IN" word="with"/>
                <t pos="DT" word="this"/>
                <t pos="JJ" word="gay"/>
                <t pos="NN" word="name"/>
            </terminals>

        </Post>
            ...
            ...

我只想要实际的帖子:

now im left with this gay name

在 NLTK 中如何做或(无论如何)在本地磁盘中剥离编码后保存裸帖子?

2) 总机成绩单。此类文件(解压后文件名为discourse)包含以下格式。我想要的是删除前面的标记:

o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
... 

我只想:

Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
... 

提前非常感谢。

NLTK comes with some samples of corpus at:
http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

I want to have only text without encodings. I do not know how to extract such content. What I want to extract are

1) nps_chat: filename is like 10-19-20s_706posts.xml after unzip.
Such file is XML format like:

<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>

                <t pos="RB" word="now"/>
                <t pos="PRP" word="im"/>
                <t pos="VBD" word="left"/>
                <t pos="IN" word="with"/>
                <t pos="DT" word="this"/>
                <t pos="JJ" word="gay"/>
                <t pos="NN" word="name"/>
            </terminals>

        </Post>
            ...
            ...

I only want that actual post:

now im left with this gay name

How can do in NLTK or (whatever) to save the bare posts after stripping encodings in local disk?

2) switchboard transcript. This type of file (filename is discourse after unzip) contains the following formats. What I want is to strip preceding markers:

o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
... 

I want to have only:

Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
... 

Thank you very much in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

绅刃 2024-10-20 07:45:30

首先,您需要为语料库制作一个语料库阅读器。您可以在 nltk.corpus 中使用一些语料库阅读器,例如:

AlpinoCorpusReader
BNCCorpusReader
BracketParseCorpusReader
CMUDictCorpusReader
CategorizedCorpusReader
CategorizedPlaintextCorpusReader
CategorizedTaggedCorpusReader
ChunkedCorpusReader
ConllChunkCorpusReader
ConllCorpusReader
CorpusReader
DependencyCorpusReader
EuroparlCorpusReader
IEERCorpusReader
IPIPANCorpusReader
IndianCorpusReader
MacMorphoCorpusReader
NPSChatCorpusReader
NombankCorpusReader
PPAttachmentCorpusReader
Pl196xCorpusReader
PlaintextCorpusReader
PortugueseCategorizedPlaintextCorpusReader
PropbankCorpusReader
RTECorpusReader
SensevalCorpusReader
SinicaTreebankCorpusReader
StringCategoryCorpusReader
SwadeshCorpusReader
SwitchboardCorpusReader
SyntaxCorpusReader
TaggedCorpusReader
TimitCorpusReader
ToolboxCorpusReader
VerbnetCorpusReader
WordListCorpusReader
WordNetCorpusReader
WordNetICCorpusReader
XMLCorpusReader
YCOECorpusReader

一旦您从语料库中创建了语料库阅读器,如下所示:

c = nltk.corpus.whateverCorpusReaderYouChoose(directoryWithCorpus, regexForFileTypes)

您可以使用以下方法从语料库中获取单词 :以下代码:

paragraphs = [para for para in c.paras()]
for para in paragraphs:
    words = [word for sentence in para for word in sentence]

这应该会为您提供语料库所有段落中所有单词的列表。

希望这有帮助

First, you need to make a corpus reader for the corpus. There are some corpus readers that you can use in nltk.corpus, such as:

AlpinoCorpusReader
BNCCorpusReader
BracketParseCorpusReader
CMUDictCorpusReader
CategorizedCorpusReader
CategorizedPlaintextCorpusReader
CategorizedTaggedCorpusReader
ChunkedCorpusReader
ConllChunkCorpusReader
ConllCorpusReader
CorpusReader
DependencyCorpusReader
EuroparlCorpusReader
IEERCorpusReader
IPIPANCorpusReader
IndianCorpusReader
MacMorphoCorpusReader
NPSChatCorpusReader
NombankCorpusReader
PPAttachmentCorpusReader
Pl196xCorpusReader
PlaintextCorpusReader
PortugueseCategorizedPlaintextCorpusReader
PropbankCorpusReader
RTECorpusReader
SensevalCorpusReader
SinicaTreebankCorpusReader
StringCategoryCorpusReader
SwadeshCorpusReader
SwitchboardCorpusReader
SyntaxCorpusReader
TaggedCorpusReader
TimitCorpusReader
ToolboxCorpusReader
VerbnetCorpusReader
WordListCorpusReader
WordNetCorpusReader
WordNetICCorpusReader
XMLCorpusReader
YCOECorpusReader

Once you've made a corpus reader out of your corpus like so:

c = nltk.corpus.whateverCorpusReaderYouChoose(directoryWithCorpus, regexForFileTypes)

you can get the words out of the corpus by using the following code:

paragraphs = [para for para in c.paras()]
for para in paragraphs:
    words = [word for sentence in para for word in sentence]

This should get you a list of all the words in all the paragraphs of your corpus.

Hope this helps

不如归去 2024-10-20 07:45:30

您可以使用 nltk 语料库中的 .words() 属性

content = nps_chat.words()

这将为您提供列表中的所有单词

['now', '我','左','与','这个','同性恋','名字',...]

You can use .words() property from nltk corpus

content = nps_chat.words()

This will give you all the words in a list

['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文