您会使用哪种数据结构:TreeMap 还是 HashMap? (爪哇)
描述 | 一个 Java 程序,用于读取文本文件并按字母顺序打印每个唯一单词以及该单词在文本中出现的次数。
程序应该声明一个Map
类型的变量来存储单词和相应的出现频率。 哪种具体类型呢? TreeMap
或 HashMap
?
输入应转换为小写。
单词不包含以下任何字符: \t\t\n]f.,!?:;\"()'
示例输出 |
Word Frequency
a 1
and 5
appearances 1
as 1
.
.
.
备注| 我知道,我已经在 Perl 中看到了大约两行代码的优雅解决方案,但是,我想在 Java 中看到它
编辑:哦,是的,使用其中之一来展示一个实现会很有帮助。这些结构(在 Java 中)。
Description | A Java program to read a text file and print each of the unique words in alphabetical order together with the number of times the word occurs in the text.
The program should declare a variable of type Map<String, Integer>
to store the words and corresponding frequency of occurrence. Which concrete type, though? TreeMap<String, Number>
or HashMap<String, Number>
?
The input should be converted to lower case.
A word does not contain any of these characters: \t\t\n]f.,!?:;\"()'
Example output |
Word Frequency
a 1
and 5
appearances 1
as 1
.
.
.
Remark | I know, I've seen elegant solutions to this in Perl with roughly two lines of code. However, I want to see it in Java.
Edit: Oh yeah, it be helpful to show an implementation using one of these structures (in Java).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
根据速度要求,您还可以使用 Trie。 但如果 TreeMap 足够快,那么实现其中之一就没有意义。
Depending on what the speed requirements are, you could also use a Trie. But there's no point in implementing one of those if a TreeMap is quick enough.
考虑添加或删除数据结构的频率。 如果 TreeMap 很高的话,那就不太理想了。 除了搜索现有条目 nLn 之外,它还会进行频繁的重新平衡。
另一方面,哈希结构在内存上有点浮夸(过度分配)。 如果你能咬紧牙关,那么就选择哈希结构并在需要时进行排序。
consider the frequency of addition or deletion to the data structure. TreeMap would not be ideal if it is high. Apart from the search for existing entry nLn it also undergoes frequent rebalancing.
on the other hand Hash structures are bit flamboyant on memory (over allocates). If you can bite that bullet then go for hash structure and sort when required.
这是读取文本文件的 java 示例,根据键排序,然后根据值排序; 取决于文件中单词出现的次数。
Here is the java example for reading a text file, sorting based on key, then upon values; depending on the number of occurrence of a words in the file.
为什么不使用 TreeSet ?
与 TreeMap 相同的排序概念,只不过它是一个 Set - 根据定义,它是“不包含重复元素的集合”。
从您的问题描述来看,听起来好像您需要一个集合,我看不到您将哪些键和值映射在一起。
Why not use TreeSet?
Same ordering concept as a TreeMap, except it's a Set - which, by definition, is "A collection that contains no duplicate elements".
From your problem description, it sounds as if you need a Set, I don't see what keys and values you are mapping together.
基本上这取决于要求。 有时哈希图很好,有时树图也很好。 但哈希映射最好仅使用它们的一些约束来对其进行排序。
Basically it depend on the requirement. Sometimes hash map is good sometimes treemap. but hash map is better to use only their is some constraint for overhead to sort it.
TreeMap
对我来说似乎是理所当然的 - 仅仅是因为“按字母顺序”的要求。HashMap
在迭代时没有顺序;TreeMap
按自然键顺序迭代。编辑:我认为康拉德的评论可能是建议“使用
HashMap
,然后排序”。 这很好,因为虽然我们最初会有 N 次迭代,但由于重复,最终我们将拥有 K <= N 个键。 我们不妨把昂贵的部分(排序)保存到最后,当我们拥有的键数少于我们在进行过程中保持排序的小但非恒定的打击时。话虽如此,我暂时坚持我的答案:因为这是实现目标的最简单方式。 我们真的不知道OP是否特别担心性能,但这个问题暗示他关心优雅和简洁。 使用
TreeMap
使这变得非常简短,这对我很有吸引力。 我怀疑如果性能确实是一个问题,那么可能有比TreeMap
或HashMap
更好的方法来攻击它:)TreeMap
seems a no-brainer to me - simply because of the "in alphabetical order" requirement.HashMap
has no ordering when you iterate through it;TreeMap
iterates in the natural key order.EDIT: I think Konrad's comment may have been suggesting "use
HashMap
, then sort." This is good because although we'll have N iterations initially, we'll have K <= N keys by the end due to duplicates. We might as well save the expensive bit (sorting) until the end when we've got fewer keys than take the small-but-non-constant hit of keeping it sorted as we go.Having said that, I'm sticking to my answer for the moment: because it's the simplest way of achieving the goal. We don't really know that the OP is particularly worried about performance, but the question implies that he's concerned about the elegance and brevity. Using a
TreeMap
makes this incredibly brief, which appeals to me. I suspect that if performance is really an issue, there may be a better way of attacking it than eitherTreeMap
orHashMap
:)TreeMap 胜过 HashMap,因为 TreeMap 已经为您排序了。
但是,您可能需要考虑使用更合适的数据结构,即包。 看
Commons Collections - 和 TreeBag 类:
这有一个很好的优化的内部结构和 API:
编辑:HashMap 与 TreeMap 性能的问题是Jon 回答 - HashMap 和排序可能更快(尝试一下!),但 TreeBag 更容易。 对于包包来说也是如此。 有一个 HashBag 和一个 TreeBag。 根据实现(使用可变整数),包的性能应该优于等效的整数普通映射。 与任何性能问题一样,唯一确定的方法就是进行测试。
TreeMap beats HashMap because TreeMap is already sorted for you.
However, you might want to consider using a more appropriate data structure, a bag. See
Commons Collections - and the TreeBag class:
This has a nice optimised internal structure and API:
EDIT: The question of HashMap vs TreeMap performance was answered by Jon - HashMap and sort may be quicker (try it!), but TreeBag is easier. The same is true for bags. There is a HashBag as well as a TreeBag. Based on the implementation (uses a mutable integer) a bag should outperform the equivalent plain map of Integer. The only way to know for sure is to test, as with any performance question.
我看到很多人说“TreeMap 查找需要
O(n log n)
”! 怎么会?我不知道它是如何实现的,但在我看来,它需要
O(log n)
。这是因为树中的查找可以在
O(log n)
内完成。 每次向树中插入项目时,不必对整个树进行排序。 这就是使用树的全部想法!因此,回到最初的问题,比较的数字是:
HashMap 方法:
O(n + k log k)
平均情况,最坏情况可能是更大的TreeMap 方法:
O(k + n log k)
最坏的情况,其中 n = 文本中的单词数,k = 文本中不同单词的数量。
I see quite a few people saying "TreeMap look-up takes
O(n log n)
"!! How come?I don't know how it has been implemented but in my head it takes
O(log n)
.This is because look-up in a tree can be done in
O(log n)
. You don't sort the entire tree every time you insert an item in it. That's the whole idea of using a tree!Hence, going back to the original question, the figures for comparison turn out to be:
HashMap approach:
O(n + k log k)
average case, worst case could be much largerTreeMap approach:
O(k + n log k)
worst casewhere n = number of words in the text , k = number of distinct words in the text.
哈希映射应该快得多。 您不应该根据您希望物品最终如何排列来选择容器; 只需对最后的(单词,频率)对列表进行排序即可。 通常,要排序的此类对比文件中的单词少,因此使用哈希映射的渐近(和真实)性能会更好。
Hash map should be much faster. You should not choose a container based on how you want the items to be arranged eventually; Just sort the list of (word, frequency)-pairs at the end. There will usually be less such pairs to be sorted than words in the files, so asymptotic (and real) performance with a hash map will be better.
您无法将
TreeMap
分配给类型为Map
的变量。Double
、Long
等可以“放入”到TreeMap
中。 当我从Map
中“获取”一个值时,它必须是一个Integer
。完全忽略任何 i18n 问题、内存限制和错误处理,如下所示:
You can't assign a
TreeMap<String,Number>
to a variable with the typeMap<String,Integer>
.Double
,Long
, etc. can be "put" into aTreeMap<String,Number>
. When I "get" a value from aMap<String,Integer>
, it must be anInteger
.Completely ignoring any i18n issues, memory constraints, and error handling, here goes:
“当一个键已经存在时,它具有与 HashMap 相同的性能。” - 这完全是错误的。 HashMap 的插入操作为 O(1),TreeMap 的插入操作为 O(n log n)。 至少需要 n log n 次检查才能确定它是否在表中!
"When a key already exists it has the same performance as a HashMap." - That is just plain wrong. HashMap has O(1) insertion and TreeMap O(n log n). It'll take at least n log n checks to find out if it's in the table!
为此,我认为最好使用 HashBag 来自 Apache Commons Collections或 HashMultiset 来自番石榴 或 HashBag 来自 Eclipse Collections(正式 GS Collections)或任何以下类:
示例:
1. 使用 Apache 的 SynchronizedSortedBag:
2. 使用 Eclipse 中的 TreeBag(GC):
3. 使用 Guava 中的 LinkedHashMultiset:
您可以在我的 github 项目中找到更多示例
For this way, in my opinion, better use HashBag from Apache Commons Collections or HashMultiset from Guava or HashBag from Eclipse Collections (formaly GS Collections) or any following classes:
Examples:
1. Using SynchronizedSortedBag from Apache:
2. Using TreeBag from Eclipse(GC):
3. Using LinkedHashMultiset from Guava:
More examples you can find in my github projects
我肯定会选择 TreeMap:
TreeSet 内部使用 TreeMap,所以为什么不直接使用 TreeMap。
I would definitely choose a TreeMap:
A TreeSet internally uses a TreeMap so why not use TreeMap directly.