计算文件中重复的单词数

发布于 2024-09-27 18:46:43 字数 263 浏览 1 评论 0原文

目标:查找文件中所有单词的计数。文件包含 1000 多个单词

我的方法:使用 HashMap() 来存储并统计每个单词在文件中出现的次数。

问题: HashMap() 是最好的方法还是使用二叉树来确保更快的查找会更好,因为文件中有大量单词?

或者有更好的方法来做到这一点吗?

HashMap 会导致大量的内存开销,这是不希望的。

Goal: to find count of all words in a file. file contains 1000+ words

My approach: use a HashMap<String,Integer>() to store and count the number of times each word appears in the file.

Question:
Would a HashMap() be the best way or would it be better to use a Binary Tree for ensuring faster lookup as there is a large count of words in the file?

Or is there a better way to do this?

HashMap would result in a lot of memory overhead which is not desired.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

悲凉≈ 2024-10-04 18:46:43

那么您正在寻找不同的单词吗?

我能想到的最有效的结构是 Trie

这是一个开源实现:Google Code patricia-trie

虽然我倾向于同意 Mitch Wheat 的观点——听起来 HashMap 应该可以工作很好(最好避免过早优化......所以你应该使用 HashMap,直到你证明它是一个瓶颈)

So you're looking for distinct words?

The most efficient structure I can think of is a Trie

Here's one open source implementation: Google Code patricia-trie

Although I tend to agree with Mitch Wheat -- It sounds like a HashMap should work fine (It's always best to avoid premature optimization... so you should use a HashMap until you've shown that it's a bottleneck)

虫児飞 2024-10-04 18:46:43

1000-10000字是非常小的。

哈希图就可以了。

1000 - 10000 words is very small.

A Hashmap will be fine.

沉鱼一梦 2024-10-04 18:46:43

我建议在 Perl/PHP 中完成这样的任务。用机关枪打死苍蝇是非常困难的。

I would recommend doing such a task in Perl/PHP. It's very hard to kill a fly with a machine gun.

合久必婚 2024-10-04 18:46:43

HashMap 是完美的。您需要存储

  • 遇到的每个单词的副本
  • 每个

HashMap 确实不会存储更多!

A HashMap is perfect. You need to store

  • a copy of each word encountered
  • the count for each

A HashMap really won't store much more than that!

墨落成白 2024-10-04 18:46:43
  1. 假设字符串不是太长,迈克尔建议的“Trie”方法会很好。 Trie 中的节点可以存储该字符以及以该字符结尾的字符串的“计数”。这应该大大减少存储要求(再次假设字符串均匀分布和重叠)

  2. 假设计数不会在调用之间持久化,在使用 HashMap 时,让 Map 来自 Integer =>整数 - 其中“键”是字符串的哈希码,值是计数。这应该是一个有效的解决方案 - 具有快速查找和减少内存占用的功能。

  1. Assuming that the strings are not insanely long, a "Trie" approach as Michael suggest would be good. The node in the Trie can store the character and the "count" of the strings that end with that character. This should drastically reduce the storage requirements (again assuming the strings are uniformly distributed and overlapping)

  2. Assuming that the counts are not to be persisted across invocations, while using a HashMap, let the Map be from Integer => Integer - where the "key" is the hashcode of the string and value the count. This should be a efficient solution - with fast lookup and reduced memory foot print.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文