计算文件中重复的单词数
目标:查找文件中所有单词的计数。文件包含 1000 多个单词
我的方法:使用 HashMap
来存储并统计每个单词在文件中出现的次数。
问题: HashMap()
是最好的方法还是使用二叉树来确保更快的查找会更好,因为文件中有大量单词?
或者有更好的方法来做到这一点吗?
HashMap 会导致大量的内存开销,这是不希望的。
Goal: to find count of all words in a file. file contains 1000+ words
My approach: use a HashMap<String,Integer>()
to store and count the number of times each word appears in the file.
Question:
Would a HashMap()
be the best way or would it be better to use a Binary Tree for ensuring faster lookup as there is a large count of words in the file?
Or is there a better way to do this?
HashMap would result in a lot of memory overhead which is not desired.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
那么您正在寻找不同的单词吗?
我能想到的最有效的结构是 Trie
这是一个开源实现:Google Code patricia-trie
虽然我倾向于同意 Mitch Wheat 的观点——听起来 HashMap 应该可以工作很好(最好避免过早优化......所以你应该使用 HashMap,直到你证明它是一个瓶颈)
So you're looking for distinct words?
The most efficient structure I can think of is a Trie
Here's one open source implementation: Google Code patricia-trie
Although I tend to agree with Mitch Wheat -- It sounds like a HashMap should work fine (It's always best to avoid premature optimization... so you should use a HashMap until you've shown that it's a bottleneck)
1000-10000字是非常小的。
哈希图就可以了。
1000 - 10000 words is very small.
A Hashmap will be fine.
我建议在 Perl/PHP 中完成这样的任务。用机关枪打死苍蝇是非常困难的。
I would recommend doing such a task in Perl/PHP. It's very hard to kill a fly with a machine gun.
HashMap 是完美的。您需要存储
HashMap 确实不会存储更多!
A HashMap is perfect. You need to store
A HashMap really won't store much more than that!
假设字符串不是太长,迈克尔建议的“Trie”方法会很好。 Trie 中的节点可以存储该字符以及以该字符结尾的字符串的“计数”。这应该大大减少存储要求(再次假设字符串均匀分布和重叠)
假设计数不会在调用之间持久化,在使用 HashMap 时,让 Map 来自 Integer =>整数 - 其中“键”是字符串的哈希码,值是计数。这应该是一个有效的解决方案 - 具有快速查找和减少内存占用的功能。
Assuming that the strings are not insanely long, a "Trie" approach as Michael suggest would be good. The node in the Trie can store the character and the "count" of the strings that end with that character. This should drastically reduce the storage requirements (again assuming the strings are uniformly distributed and overlapping)
Assuming that the counts are not to be persisted across invocations, while using a HashMap, let the Map be from Integer => Integer - where the "key" is the hashcode of the string and value the count. This should be a efficient solution - with fast lookup and reduced memory foot print.