存储文档中找到的单词以及其出现次数的计数器的最佳数据结构是什么?
假设我有一个文档语料库,我想逐一阅读并将它们存储在数据结构中。该结构可能是某物的列表。该类将定义一个文档。在该类中,我必须使用数据结构来存储每个文档的内容,那应该是什么?另外,如果我想计算单词的出现次数并检索每个文档中最常见的单词,我是否必须使用允许我及时执行此操作的数据结构<依次检查所有内容需要 O(n) 时间?
Let's say I have a corpus of documents which I want to read one by one and store them in a data structure. The structure will probably be a list of something. That something class will define a single document. Inside that class I'll have to use a data structure to store the contents from each document, what that should be? Also, if I want to count occurrences of words and retrieve the most frequent words in each document, will I have to use a data structure that will allow me to do this in time < O(n) that would take to examine all the contents sequentially?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用关联数组,也称为映射或字典,因为不同的编程语言对相同的数据使用不同的术语结构。
每个输入键都是一个单词,计数器是输入的值。例如
Use an associative array, also called map or dictionary since different programming languages use different terms for the same data structure.
Every entry key would be a word and the counter would be the value of the entry. For example