我正在创建一个使用字典的拼字游戏。为了提高效率,不是将整个字典(通过txt文件)加载到数据结构(集合、列表等)中,而是有任何内置的java类可以帮助我将文件的内容视为字符串。
具体来说,我想做的是通过执行诸如 fileName.contains (word) 之类的简单操作来检查游戏中创建的单词是否是字典中的有效单词,而不是使用内存效率低下的巨大列表并使用 list.contains (word )。
你们知道我能做什么吗?如果字典文件必须位于 txt 文件之外的其他文件中(例如 xml 文件),我也愿意尝试。
注意:我不是在寻找 http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29
这个方法不是 java API 的一部分。
HashSet 没有出现在我的脑海中,我一直坚持所有 contains () 方法都使用 O(n) 时间的想法,感谢 Bozho 清除了这个问题我,看起来我将使用 HashSet。
I am creating a Scrabble game that uses a dictionary. For efficiency, instead of loading the entire dictionary (via txt file) to a Data Structure (Set, List etc.) is there any built in java class that can help me treat the contents of the file as String.
Specifically what I want to do is check whether a word made in the game is a valid word of the dictionary by doing something simple like fileName.contains (word) instead of having a huge list that is memory inefficient and using list.contains (word).
Do you guys have any idea on what I may be able to do. If the dictionary file has to be in something other than a txt file (e.g. xml file), I am open to try that as well.
NOTE: I am not looking for http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29
This method is not a part of the java API.
HashSet didn't come to mind, I was stuck in the idea that all contains () methods used O(n) time, thanks to Bozho for clearing that with me, looks like I will be using a HashSet.
发布评论
评论(4)
我认为最好的选择是将它们全部加载到内存中的
HashSet
中。其中contains(word)
的复杂度为 O(1)。如果您愿意将其保留在内存中,那么将其作为
String
来调用contains(..)
的效率比HashSet
低得多>。我不得不提另一个选择 - 有一个数据结构来表示字典 - 它被称为
Trie
。但你在 JDK 中找不到实现。一个非常粗略的计算表明,对于所有英语单词(100 万),您将需要约 12 MB 的 RAM。这比 JVM 的默认内存设置小几倍。 (100 万 * 平均 6 个字母 * 每个字母 2 字节 = 1200 万字节,即约 12 兆字节)。 (好吧,也许多一点来存储哈希值)
如果您真的坚持不在内存中读取它,并且您想扫描文件中的给定单词,那么您可以使用
java.util.Scanner
及其scanner.findWithHorizon (..)
。但这是低效的——我假设 O(n) 和 I/O 开销。I think your best option is to load them all in memory, in a
HashSet
. Therecontains(word)
is O(1).If you are fine with having it in memory, having it as
String
on which to callcontains(..)
is much less efficient than aHashSet
.And I have to mention another option - there's a data structure to represent dictionaries - it's called
Trie
. You can't find an implementation in the JDK though.A very rough calculation says that with all English words (1 million) you will need ~12 megabytes of RAM. which is a few times less than the default memory settings of the JVM. (1 million * 6 letters on average * 2 bytes per letter = 12 milion bytes, which is ~12 megabytes). (Well, perhaps a bit more to store hashes)
If you really insist on not reading it in memory, and you want to scan the file for a given word, so you can use a
java.util.Scanner
and itsscanner.findWithHorizon(..)
. But that would be inefficient - I assume O(n), and I/O overhead.虽然 HashSet 可能是一个完全可以接受的解决方案(请参阅 Bozho 的答案),但还有其他可以使用的数据结构,包括 Trie 或 Heap。
Trie 的优点是,根据实现细节,起始前缀字母可以共享(毕竟,trie 也称为“前缀树”)。根据实现结构和数据,这实际上可能是也可能不是改进。
另一种选择,特别是如果需要基于文件的访问,是使用 堆 -- Java 的 PriorityQueue 实际上是一个堆,但它不是基于文件的,因此这需要找到/制作一个实现。
所有这些数据结构(以及更多)都可以实现为基于文件(每次查找使用更多 IO - 这实际上可能不太全面 - 但可以节省内存)或直接实现(例如使用 SQLite 并让它做 B 树的事情)。 SQLite 的优点在于它可以成为工具箱中的“通用工具”(曾经常用;-);数据导入、检查和修改都很简单,而且“它就是有效”。 SQLite 甚至用于功能较弱的系统,例如 Android。
HashSet 随 Java “免费”提供,但没有标准的 Trie 或基于文件的 Heap 实现。我将从 HashSet - Reasoning:
快乐编码。
随机数据结构实现的链接(可能适合也可能不适合):
HashSet
的内存需求。 (Java 中的字符串使用 UTF-16 编码,每个字符最少为两个字节。)While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.
The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.
Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.
All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.
HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:
Happy coding.
Links to random data-structure implementations (may or may not be suitable):
HashSet<String>
when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)您需要压缩数据以避免存储所有这些单词。这样做的方法是一棵树,其中节点是字母,叶子反映单词的结尾。这样,您就不会存储重复的数据,例如这些单词都具有相同前缀的
thetherethes
。有一种方法可以使该解决方案的内存效率更高。 (提示:字母顺序)
You need to compress your data to avoid having to store all those words. The way to do so would be a tree in which nodes are letters and leaves reflect the end of a word. This way you're not storing repetitive data such as
the there these
where those words all have the same prefix.There is a way to make this solution even more memory efficient. (Hint: letter order)
使用java.io.BufferedReader的readline()。返回一个字符串。
Use the readline() of java.io.BufferedReader. That returns a string.