当前位置：文江博客话题详情

如何将文件内容视为字符串

发布于 2024-11-09 10:06:22 字数 694 浏览 0 评论 0 原文

我正在创建一个使用字典的拼字游戏。为了提高效率，不是将整个字典（通过txt文件）加载到数据结构（集合、列表等）中，而是有任何内置的java类可以帮助我将文件的内容视为字符串。

具体来说，我想做的是通过执行诸如 fileName.contains (word) 之类的简单操作来检查游戏中创建的单词是否是字典中的有效单词，而不是使用内存效率低下的巨大列表并使用 list.contains (word ）。

你们知道我能做什么吗？如果字典文件必须位于 txt 文件之外的其他文件中（例如 xml 文件），我也愿意尝试。

注意：我不是在寻找 http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29

这个方法不是 java API 的一部分。

HashSet 没有出现在我的脑海中，我一直坚持所有 contains () 方法都使用 O(n) 时间的想法，感谢 Bozho 清除了这个问题我，看起来我将使用 HashSet。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三生池水覆流年 2024-11-16 10:06:22

我认为最好的选择是将它们全部加载到内存中的 HashSet 中。其中 contains(word) 的复杂度为 O(1)。

如果您愿意将其保留在内存中，那么将其作为 String 来调用 contains(..) 的效率比 HashSet 低得多>。

我不得不提另一个选择 - 有一个数据结构来表示字典 - 它被称为 Trie。但你在 JDK 中找不到实现。

一个非常粗略的计算表明，对于所有英语单词（100 万），您将需要约 12 MB 的 RAM。这比 JVM 的默认内存设置小几倍。（100 万 * 平均 6 个字母 * 每个字母 2 字节 = 1200 万字节，即约 12 兆字节）。（好吧，也许多一点来存储哈希值）

如果您真的坚持不在内存中读取它，并且您想扫描文件中的给定单词，那么您可以使用 java.util.Scanner 及其 scanner.findWithHorizon (..)。但这是低效的——我假设 O(n) 和 I/O 开销。

回复收藏 0 原文

沫尐诺 2024-11-16 10:06:22

虽然 HashSet 可能是一个完全可以接受的解决方案（请参阅 Bozho 的答案），但还有其他可以使用的数据结构，包括 Trie 或 Heap。

Trie 的优点是，根据实现细节，起始前缀字母可以共享（毕竟，trie 也称为“前缀树”）。根据实现结构和数据，这实际上可能是也可能不是改进。

另一种选择，特别是如果需要基于文件的访问，是使用堆 -- Java 的 PriorityQueue 实际上是一个堆，但它不是基于文件的，因此这需要找到/制作一个实现。

所有这些数据结构（以及更多）都可以实现为基于文件（每次查找使用更多 IO - 这实际上可能不太全面 - 但可以节省内存）或直接实现（例如使用 SQLite 并让它做 B 树的事情）。 SQLite 的优点在于它可以成为工具箱中的“通用工具”（曾经常用;-）；数据导入、检查和修改都很简单，而且“它就是有效”。 SQLite 甚至用于功能较弱的系统，例如 Android。

HashSet 随 Java “免费”提供，但没有标准的 Trie 或基于文件的 Heap 实现。我将从 HashSet - Reasoning：

Dictionary = 5MB 开始。
加载到 HashSet（假设有大量开销）= 20MB。
与其他事物相关的内存使用量 = 最少（假设笔记本电脑/台式机）
使用 HashSet 实现的时间 = 2 分钟。
如果我认为 HashSet 不够好，我只会“损失”2 分钟:-)

快乐编码。

随机数据结构实现的链接（可能适合也可能不适合）：

TernarySearchTrie 读取平面文件（必须专门构建？）
TrieTree 支持从平面文件创建 Trie 文件。不确定这个 Trie 是否可以从磁盘运行。
FileHash 使用文件支持的哈希。
HashStore 另一个基于磁盘的哈希
WB B-Tree 简单 B-tree 实现 /“数据库”
SQLite 小型嵌入式关系型数据库管理系统。
UTF8String 可以用于显着减少使用拉丁语字典时使用 HashSet 的内存需求。（Java 中的字符串使用 UTF-16 编码，每个字符最少为两个字节。）

While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.

The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.

Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.

All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.

HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:

Dictionary = 5MB.
Loaded into HashSet (assuming lots of overhead) = 20MB.
Memory usage in relation to other things = Minimal (assumes laptop/desktop)
Time to implement with HashSet = 2 Minutes.
I will have only "lost" 2 Minutes if I decide a HashSet wasn't good enough :-)

Happy coding.

Links to random data-structure implementations (may or may not be suitable):

TernarySearchTrie Reads in a flat file (must be specially constructed?)
TrieTree Has support for creating the Trie file from a flat file. Not sure if this Trie works from disk.
FileHash Hash which uses a file backing.
HashStore Another disk-based hash
WB B-Tree Simple B-tree implementation / "database"
SQLite Small embedded RDBMS.
UTF8String Can be used to significantly reduce the memory requirements of using HashSet<String> when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)

回复收藏 0 原文

梦归所梦 2024-11-16 10:06:22

您需要压缩数据以避免存储所有这些单词。这样做的方法是一棵树，其中节点是字母，叶子反映单词的结尾。这样，您就不会存储重复的数据，例如这些单词都具有相同前缀的thetherethes。

有一种方法可以使该解决方案的内存效率更高。（提示：字母顺序）

回复收藏 0 原文

天煞孤星 2024-11-16 10:06:22

使用java.io.BufferedReader的readline()。返回一个字符串。

String line = new BufferedReader (new FileReader (file) ).readline ();

Use the readline() of java.io.BufferedReader. That returns a string.

String line = new BufferedReader (new FileReader (file) ).readline ();

回复收藏 0 原文

~没有更多了~

关于作者

泪之魂

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

如何将文件内容视为字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

lorenzathorton8

Zero

萧瑟寒风

mylayout

tkewei

17818769742

友情链接

如何将文件内容视为字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

lorenzathorton8

Zero

萧瑟寒风

mylayout

tkewei

17818769742

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。