在NLTK中创建自己的语料库的优势
我的 Mysql 表中有大量文本。我想使用 NLTK 工具包对我的文本进行一些统计分析,然后进行一些 NLP。 我有两个选择:
- 从数据库表中一次性提取所有文本(如果需要,可能将它们放入文件中)并使用 NLTK 函数
- 提取文本并将其转换为可与 NLTK 一起使用的“语料库”。
后者似乎相当复杂,我还没有找到任何实际描述如何使用它的文章我只找到了这个: 创建一个 MongoDB 支持的语料库阅读器,使用 MongoDB 作为数据库,代码相当复杂,也需要了解 MongoDB。另一方面,前者看起来非常简单,但会导致从数据库中提取文本的开销。
现在的问题是,NLTK中语料库的优势是什么?换句话说,如果我接受挑战并深入研究覆盖 NTLK 方法以便它可以从 MySQL 数据库中读取数据,那么值得这么麻烦吗?将我的文本转换为语料库是否会给我带来一些我无法(或很难)使用普通 NLTK 函数实现的功能?
另外,如果您知道有关将 MySQL 连接到 NLTK 的信息,请告诉我。 谢谢
I have a large amount of text in Mysql tables. I want to do some statistical analysis and later on some NLP on my text using the NLTK toolkit.
I have two choices:
- Extract all the text at once from my DB table (maybe putting them in a file if needed) and use the NLTK functions
- Extract the text and turning it into a "corpus" that can be used with NLTK.
The latter seems quite complicated and I haven't found any articles that actually describes how to use it I only found this: Creating a MongoDB backed corpus reader which uses MongoDB as its database and the code is quite complicated and also requires knowing MongoDB. On the other hand, the former seems really straightforward but results in an overhead extracting the texts from DB.
Now the question is that what are the advantages of corpus in NLTK? In other words, if I take the challenge and dig into overwriting NTLK methods so it can read from MySQL database, would it be worth the hassle? Does turning my text into a corpus give me something that I cannot (or with a lot of difficulty) do with ordinary NLTK functions?
Also if you know something about connecting MySQL to NLTK please let me know.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,读了很多书后我找到了答案。
有几个非常有用的函数,例如搭配、搜索、common_context、similar,可用于在 NLTK 中保存为语料库的文本。自己实施它们需要相当长的时间。如果从数据库中选择我的文本并放入文件中并使用 nltk.Text 函数,那么我可以使用我之前提到的所有函数,而无需编写那么多行代码甚至覆盖方法,以便我可以连接到 MySql。这里是更多信息的链接: nltk.Text
Well after reading a lot I found out the answer.
There are several very useful functions such as collocations,search,common_context,similar that can be used on texts that are saved as corpus in NLTK. implementing them yourself takes quite some time. If Select my text from the database and put in a file and use the
nltk.Text
function then I can use all the functions that I mentioned before without the need of writing so many lines of code or even overwriting methods so that I can connect to MySql.Here is the link for more info: nltk.Text