如何将文件内容视为字符串

发布于 2024-11-09 10:06:22 字数 694 浏览 0 评论 0 原文

我正在创建一个使用字典的拼字游戏。为了提高效率,不是将整个字典(通过txt文件)加载到数据结构(集合、列表等)中,而是有任何内置的java类可以帮助我将文件的内容视为字符串。

具体来说,我想做的是通过执行诸如 fileName.contains (word) 之类的简单操作来检查游戏中创建的单词是否是字典中的有效单词,而不是使用内存效率低下的巨大列表并使用 list.contains (word )。

你们知道我能做什么吗?如果字典文件必须位于 txt 文件之外的其他文件中(例如 xml 文件),我也愿意尝试。

注意:我不是在寻找 http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29

这个方法不是 java API 的一部分。

HashSet 没有出现在我的脑海中,我一直坚持所有 contains () 方法都使用 O(n) 时间的想法,感谢 Bozho 清除了这个问题我,看起来我将使用 HashSet。

I am creating a Scrabble game that uses a dictionary. For efficiency, instead of loading the entire dictionary (via txt file) to a Data Structure (Set, List etc.) is there any built in java class that can help me treat the contents of the file as String.

Specifically what I want to do is check whether a word made in the game is a valid word of the dictionary by doing something simple like fileName.contains (word) instead of having a huge list that is memory inefficient and using list.contains (word).

Do you guys have any idea on what I may be able to do. If the dictionary file has to be in something other than a txt file (e.g. xml file), I am open to try that as well.

NOTE: I am not looking for http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29

This method is not a part of the java API.

HashSet didn't come to mind, I was stuck in the idea that all contains () methods used O(n) time, thanks to Bozho for clearing that with me, looks like I will be using a HashSet.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

三生池水覆流年 2024-11-16 10:06:22

我认为最好的选择是将它们全部加载到内存中的 HashSet 中。其中 contains(word) 的复杂度为 O(1)。

如果您愿意将其保留在内存中,那么将其作为 String 来调用 contains(..) 的效率比 HashSet 低得多>。

我不得不提另一个选择 - 有一个数据结构来表示字典 - 它被称为 Trie。但你在 JDK 中找不到实现。

一个非常粗略的计算表明,对于所有英语单词(100 万),您将需要约 12 MB 的 RAM。这比 JVM 的默认内存设置小几倍。 (100 万 * 平均 6 个字母 * 每个字母 2 字节 = 1200 万字节,即约 12 兆字节)。 (好吧,也许多一点来存储哈希值)

如果您真的坚持不在内存中读取它,并且您想扫描文件中的给定单词,那么您可以使用 java.util.Scanner 及其 scanner.findWithHorizo​​n (..)。但这是低效的——我假设 O(n) 和 I/O 开销。

I think your best option is to load them all in memory, in a HashSet. There contains(word) is O(1).

If you are fine with having it in memory, having it as String on which to call contains(..) is much less efficient than a HashSet.

And I have to mention another option - there's a data structure to represent dictionaries - it's called Trie. You can't find an implementation in the JDK though.

A very rough calculation says that with all English words (1 million) you will need ~12 megabytes of RAM. which is a few times less than the default memory settings of the JVM. (1 million * 6 letters on average * 2 bytes per letter = 12 milion bytes, which is ~12 megabytes). (Well, perhaps a bit more to store hashes)

If you really insist on not reading it in memory, and you want to scan the file for a given word, so you can use a java.util.Scanner and its scanner.findWithHorizon(..). But that would be inefficient - I assume O(n), and I/O overhead.

沫尐诺 2024-11-16 10:06:22

虽然 HashSet 可能是一个完全可以接受的解决方案(请参阅 Bozho 的答案),但还有其他可以使用的数据结构,包括 Trie 或 Heap。

Trie 的优点是,根据实现细节,起始前缀字母可以共享(毕竟,trie 也称为“前缀树”)。根据实现结构和数据,这实际上可能是也可能不是改进。

另一种选择,特别是如果需要基于文件的访问,是使用 -- Java 的 PriorityQueue 实际上是一个堆,但它不是基于文件的,因此这需要找到/制作一个实现。

所有这些数据结构(以及更多)都可以实现为基于文件(每次查找使用更多 IO - 这实际上可能不太全面 - 但可以节省内存)或直接实现(例如使用 SQLite 并让它做 B 树的事情)。 SQLite 的优点在于它可以成为工具箱中的“通用工具”(曾经常用;-);数据导入、检查和修改都很简单,而且“它就是有效”。 SQLite 甚至用于功能较弱的系统,例如 Android。

HashSet 随 Java “免费”提供,但没有标准的 Trie 或基于文件的 Heap 实现。我将从 HashSet - Reasoning:

  1. Dictionary = 5MB 开始。
  2. 加载到 HashSet(假设有大量开销)= 20MB。
  3. 与其他事物相关的内存使用量 = 最少(假设笔记本电脑/台式机)
  4. 使用 HashSet 实现的时间 = 2 分钟。
  5. 如果我认为 HashSet 不够好,我只会“损失”2 分钟:-)

快乐编码。


随机数据结构实现的链接(可能适合也可能不适合):

  • TernarySearchTrie 读取平面文件(必须专门构建?)
  • TrieTree 支持从平面文件创建 Trie 文件。不确定这个 Trie 是否可以从磁盘运行。
  • FileHash 使用文件支持的哈希。
  • HashStore 另一个基于磁盘的哈希
  • WB B-Tree 简单 B-tree 实现 /“数据库”
  • SQLite 小型嵌入式关系型数据库管理系统。
  • UTF8String 可以用于显着减少使用拉丁语字典时使用 HashSet 的内存需求。 (Java 中的字符串使用 UTF-16 编码,每个字符最少为两个字节。)

While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.

The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.

Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.

All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.

HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:

  1. Dictionary = 5MB.
  2. Loaded into HashSet (assuming lots of overhead) = 20MB.
  3. Memory usage in relation to other things = Minimal (assumes laptop/desktop)
  4. Time to implement with HashSet = 2 Minutes.
  5. I will have only "lost" 2 Minutes if I decide a HashSet wasn't good enough :-)

Happy coding.


Links to random data-structure implementations (may or may not be suitable):

  • TernarySearchTrie Reads in a flat file (must be specially constructed?)
  • TrieTree Has support for creating the Trie file from a flat file. Not sure if this Trie works from disk.
  • FileHash Hash which uses a file backing.
  • HashStore Another disk-based hash
  • WB B-Tree Simple B-tree implementation / "database"
  • SQLite Small embedded RDBMS.
  • UTF8String Can be used to significantly reduce the memory requirements of using HashSet<String> when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)
梦归所梦 2024-11-16 10:06:22

您需要压缩数据以避免存储所有这些单词。这样做的方法是一棵树,其中节点是字母,叶子反映单词的结尾。这样,您就不会存储重复的数据,例如这些单词都具有相同前缀的thetherethes

有一种方法可以使该解决方案的内存效率更高。 (提示:字母顺序)

You need to compress your data to avoid having to store all those words. The way to do so would be a tree in which nodes are letters and leaves reflect the end of a word. This way you're not storing repetitive data such as the there these where those words all have the same prefix.

There is a way to make this solution even more memory efficient. (Hint: letter order)

天煞孤星 2024-11-16 10:06:22

使用java.io.BufferedReader的readline()。返回一个字符串。

String line = new BufferedReader (new FileReader (file) ).readline ();

Use the readline() of java.io.BufferedReader. That returns a string.

String line = new BufferedReader (new FileReader (file) ).readline ();
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文