如何在 Python 自然语言工具包中创建自己的语料库?

发布于 2024-08-20 05:01:08 字数 128 浏览 8 评论 0原文

我最近扩展了 nltk 中的姓名语料库,并想知道如何将我拥有的两个文件(male.txt、female.txt)转换为语料库,以便我可以使用现有的 nltk.corpus 方法访问它们。有人有什么建议吗?

非常感谢, 詹姆斯.

I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions?

Many thanks,
James.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

浮光之海 2024-08-27 05:01:08

作为 自述文件说,姓名语料库不属于公共领域——您应该向语料库作者发送一封电子邮件,其中包含您所做的任何更改(地址位于该文件中)。除了法律和礼貌的细节之外,您可以简单地用您自己的文件替换其中一个或两个文件,它们的格式非常简单(每行一个名称,允许注释[[并忽略]]并以 开头'#')。

要安装全新的语料库而不是仅仅调整现有的语料库,您可以从给出的文档开始 此处

As the readme says, the names corpus is not in the public domain -- you should send an email with any changes you make to the corpus author (address is in that file). Apart from that detail of law and courtesy, you can simply replace either or both of those files with your own, they're in perfectly simple format (one name per line, comments allowed [[and ignored]] and start with '#').

To install a totally new corpus rather than just tweaking an existing ones, you could start with the docs given here.

猫腻 2024-08-27 05:01:08

通过查看 nltk.corpus 中的源代码,然后查看语料库(位于 /home/[user]/nltk_data/corpora/names 中)来了解语料库阅读的工作原理 - 这可能是对于 XP 用户,在“我的文档”中;对于 Win7 用户,在“用户”中的某个位置)。

语料库的结构及其相关功能将有助于更好地理解如何使用 NLTK 中可用的不同语料库。

就我而言,我查看了 nltk.corpus 源代码中的名称变量,并对 WordListCorpusReader 函数感兴趣,因为名称语料库只是一个单词列表。

Came to understand how corpus reading works by looking at the source code in nltk.corpus and then looking at the corpora (located in /home/[user]/nltk_data/corpora/names - this will probably be in My Documents for XP and somewhere in User for Win7 users).

The structure of the corpus and its related function will give a good understanding of how to use the different corpora available in NLTK.

In my case I looked at the names variable in nltk.corpus' source code and was interested in the WordListCorpusReader function as the names corpus is simply a list of words.

回忆躺在深渊里 2024-08-27 05:01:08

亚历克斯是对的,从文档开始,找出哪个语料库阅读器适合您的语料库。给定语料库文件的路径,简单地实例化它。正如您将在文档中看到的,内置语料库只是特定语料库阅读器类的实例。查看 nltk.corpus 包中的代码应该也会有帮助。

Alex is right, start with the docs, and figure out which corpus reader will work for your corpus. The simple instantiate it, given the path to your corpus file(s). As you'll see in the docs, the builtin corpora are simply instances of particular corpus reader classes. Look thru the code in the nltk.corpus package should be helpful as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文