从大文本中构建单词词典

发布于 2024-08-28 14:16:53 字数 278 浏览 6 评论 0原文

我有一个包含英语/意大利语帖子的文本文件。我想将帖子读入数据矩阵,以便每一行代表一篇帖子,每列代表一个单词。矩阵中的单元格是每个单词在帖子中出现的次数的计数。该词典应包含整个文件中的所有单词或非详尽的英语/意大利语词典。

我知道这是 NLP 常见的基本预处理步骤。我知道编码它非常简单,但我想使用一些 NLP 领域特定的工具,这样我就可以修剪停用词等。

有谁知道可以执行此任务的工具\项目吗?

有人提到apache lucene,你知道lucene索引是否可以序列化为类似于我需要的数据结构吗?

I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words in the whole file or a non exhaustive English/Italian dictionary.

I know this is a common essential preprocessing step for NLP. And I know it's pretty trivial to code it, sill I'd like to use some NLP domain specific tool so I get stop-words trimmed etc..

Does anyone know of a tool\project that can perform this task?

Someone mentioned apache lucene, do you know if lucene index can be serialized to a data-structure similar to my needs?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

—━☆沉默づ 2024-09-04 14:16:53

也许您想查看GATE。它是文本挖掘和处理的基础设施。这就是 GATE 所做的(我从网站上得到的):

  • 能够解决几乎所有文本处理问题的开源软件
  • 一个由开发人员、用户、教育工作者、学生和科学家组成的成熟而广泛的社区
  • 一个定义的、可重复的过程,用于创建强大且可维护的过程文本处理工作流程
  • 积极用于各种语言处理任务和应用程序,包括:客户的声音;癌症研究;药物研究;决策支持;招聘;网络挖掘;信息提取;语义注释
  • 是自 1995 年以来运行的耗资数百万欧元的研发计划的结果,由商业用户、EC、BBSRC、EPSRC、AHRC、JISC 等资助,
  • 全世界的公司、中小企业、研究实验室和大学都在使用
  • Eclipse自然语言工程、信息提取的 Lucene、文本挖掘的 ISO 9001

Maybe you want to look at GATE. It is an infrastructure for text-mining and processing. This is what GATE does (I got this from the site):

  • open source software capable of solving almost any text processing problem
  • a mature and extensive community of developers, users, educators, students and scientists
  • a defined and repeatable process for creating robust and maintainable text processing workflows
  • in active use for all sorts of language processing tasks and applications, including: voice of the customer; cancer research; drug research; decision support; recruitment; web mining; information extraction; semantic annotation
  • the result of a €multi-million R&D programme running since 1995, funded by commercial users, the EC, BBSRC, EPSRC, AHRC, JISC, etc.
  • used by corporations, SMEs, research labs and Universities worldwide
  • the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining
一刻暧昧 2024-09-04 14:16:53

您想要的非常简单,因此在大多数语言中,我建议您使用从字符串映射到整数的哈希表数组来推出自己的解决方案。例如,在 C# 中:

foreach (var post in posts)
{
  var row = new Dictionary<string, int>();

  foreach (var word in GetWordsFromPost(post))
  {
    IncrementContentOfRow(row, word);
  }
}

// ...

private void IncrementContentOfRow(IDictionary<string, int> row, string word)
{
  int oldValue;
  if (!row.TryGet(word, out oldValue))
  {
    oldValue = 0;
  }

  row[word] = oldValue + 1;
}

What you want is so simple that, in most languages, I would suggest you roll your own solution using an array of hash tables that map from strings to integers. For example, in C#:

foreach (var post in posts)
{
  var row = new Dictionary<string, int>();

  foreach (var word in GetWordsFromPost(post))
  {
    IncrementContentOfRow(row, word);
  }
}

// ...

private void IncrementContentOfRow(IDictionary<string, int> row, string word)
{
  int oldValue;
  if (!row.TryGet(word, out oldValue))
  {
    oldValue = 0;
  }

  row[word] = oldValue + 1;
}
妳是的陽光 2024-09-04 14:16:53

您可以查看:

You can check out:

  • bow - a veteran C library for text classification; I know it stores the matrix, it may require some hacking to get it.
  • Weka - a Java machine learning framework that can handle text and build the matrix
  • Sujit Pal's blog post on building the term-document matrix from scratch
  • If you insist on using Lucene, you should create an index using term vectors, and use something like a loop over getTermFreqVector() to get the matrix.
知你几分 2024-09-04 14:16:53

感谢@Mikos的评论,我用谷歌搜索了术语“术语文档矩阵”并找到了 TMG(文本到矩阵生成器)。

我发现它适合我的需要。

Thanks to @Mikos' comment, I googled the term "term-document matrix' and found TMG (Text to Matrix Generator).

I found it suitable for my needs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文