减少非常大的 HashMap 的内存使用

发布于 2024-11-24 20:02:12 字数 509 浏览 1 评论 0原文

我有一个非常大的哈希映射(超过 200 万个条目),它是通过读取 CSV 文件的内容创建的。一些信息:

  1. HashMap 将 String 键(小于 20 个字符)映射到 String 值(大约 50 个字符)。
  2. 该 HashMap 初始化时初始容量为 300 万,因此负载因子约为 0.66。
  3. HashMap 仅由单个操作使用,一旦该操作完成,我就“clear()”它。 (虽然看起来这个清除实际上并没有清除内存,但是否有必要单独调用 System.gc() ?)。

我的一个想法是将 HashMap 更改为 HashMap 并使用 String 的 hashCode 作为键,这最终会节省一点内存,但如果两个字符串具有相同的哈希码,则可能会出现冲突问题......这有多大可能对于长度小于 20 个字符的字符串?

还有其他人对在这里做什么有任何想法吗? CSV 文件本身只有 100 MB,但 java 最终为此 HashMap 使用了超过 600 MB 的内存。

谢谢!

I have a very large hash map (2+ million entries) that is created by reading in the contents of a CSV file. Some information:

  1. The HashMap maps a String key (which is less than 20 chars) to a String value (which is approximately 50 characters).
  2. This HashMap is initialized with an initial capacity of 3 million so that the load factor is around .66.
  3. The HashMap is only utilized by a single operation, and once that operation is completed, I "clear()" it. (Although it doesn't appear that this clear actually clears up memory, is a separate call to System.gc() necessary?).

One idea I had was to change the HashMap to HashMap and use the hashCode of the String as the key, this will end up saving a bit of memory but risks issues with collisions if two strings have identical hash codes ... how likely is this for strings that are less than 20 characters long?

Does anyone else have any ideas on what to do here? The CSV file itself is only 100 MB, but java ends up using over 600MB in memory for this HashMap.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

友谊不毕业 2024-12-01 20:02:12

听起来你已经有了尝试这个的框架。不要添加字符串,而是添加 string.hashCode() 并查看是否会发生冲突。

在释放内存方面,JVM 一般不会变小,但如果需要的话它会进行垃圾收集。

另外,听起来您可能有一个根本不需要哈希表的算法。您能更详细地描述一下您想要做什么吗?

It sounds like you have the framework to try this already. Instead of adding the string, add the string.hashCode() and see if you get collisions.

In terms of freeing up memory, the JVM generally doesn't get smaller, but it will garbage collect if it needs to.

Also, it sounds like you might have an algorithm that doesn't need the hash table at all. Could you describe what you're trying to do in a little more detail?

伴梦长久 2024-12-01 20:02:12

解析 CSV,并构建一个映射,其键是现有键,但值是指向该键文件中位置的整数指针。

当您需要某个键的值时,请在映射中找到索引,然后使用 RandomAccessFile 从文件中读取该行。在处理过程中保持 RandomAccessFile 打开,然后在完成后将其关闭。

Parse the CSV, and build a Map whose keys are your existing keys, but values are Integer pointers to locations in the files for that key.

When you want the value for a key, find the index in the map, then use a RandomAccessFile to read that line from the file. Keep the RandomAccessFile open during processing, then close it when done.

墨落画卷 2024-12-01 20:02:12

您想要做的正是 JOIN 操作。尝试考虑像 H2 这样的内存数据库,您可以通过将两个 CSV 文件加载到临时表然后对它们进行 JOIN 来实现这一点。
根据我的经验,h2 在加载操作方面运行得很好,并且与基于手动 HashMap 的连接方法相比,该代码肯定会更快、占用更少的内存。

what you are trying to do is exactly a JOIN operation. Try considering an in-memory DB like H2 and you can achieve this by loading both CSV files to temp tables and then do a JOIN over them.
And as per my experience h2 runs great with load operation and this code will certainly be faster and less memory intensive than ur manual HashMap based joining method.

黄昏下泛黄的笔记 2024-12-01 20:02:12

如果性能不是主要考虑因素,请将条目存储在数据库中。那么内存就不再是问题了,而且由于数据库的缘故,您的搜索速度即使不是很好,也很不错。

If performance isn't the primary concern, store the entries in a database instead. Then memory isn't a concern, and you have good, if not great, search speed thanks to the database.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文