从大文本中构建单词词典
我有一个包含英语/意大利语帖子的文本文件。我想将帖子读入数据矩阵,以便每一行代表一篇帖子,每列代表一个单词。矩阵中的单元格是每个单词在帖子中出现的次数的计数。该词典应包含整个文件中的所有单词或非详尽的英语/意大利语词典。
我知道这是 NLP 常见的基本预处理步骤。我知道编码它非常简单,但我想使用一些 NLP 领域特定的工具,这样我就可以修剪停用词等。
有谁知道可以执行此任务的工具\项目吗?
有人提到apache lucene,你知道lucene索引是否可以序列化为类似于我需要的数据结构吗?
I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words in the whole file or a non exhaustive English/Italian dictionary.
I know this is a common essential preprocessing step for NLP. And I know it's pretty trivial to code it, sill I'd like to use some NLP domain specific tool so I get stop-words trimmed etc..
Does anyone know of a tool\project that can perform this task?
Someone mentioned apache lucene, do you know if lucene index can be serialized to a data-structure similar to my needs?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
也许您想查看GATE。它是文本挖掘和处理的基础设施。这就是 GATE 所做的(我从网站上得到的):
Maybe you want to look at GATE. It is an infrastructure for text-mining and processing. This is what GATE does (I got this from the site):
您想要的非常简单,因此在大多数语言中,我建议您使用从字符串映射到整数的哈希表数组来推出自己的解决方案。例如,在 C# 中:
What you want is so simple that, in most languages, I would suggest you roll your own solution using an array of hash tables that map from strings to integers. For example, in C#:
您可以查看:
You can check out:
感谢@Mikos的评论,我用谷歌搜索了术语“术语文档矩阵”并找到了 TMG(文本到矩阵生成器)。
我发现它适合我的需要。
Thanks to @Mikos' comment, I googled the term "term-document matrix' and found TMG (Text to Matrix Generator).
I found it suitable for my needs.