用于内存高效数据存储的 HashMap 替代方案

发布于 2024-09-28 15:14:12 字数 702 浏览 4 评论 0 原文

我目前有一个电子表格类型的程序,它将其数据保存在 HashMap 的 ArrayList 中。当我告诉您这并不理想时,您无疑会感到震惊。开销似乎比数据本身使用的内存多 5 倍。

这个问题询问高效的集合库,答案是使用 Google 收藏集。 我的后续行动是“哪一部分?。我一直在阅读文档,但感觉它并没有很好地说明哪些类适合于此。 (我也愿意接受其他图书馆或建议)。

因此,我正在寻找一种可以让我以最小的内存开销存储密集的电子表格类型数据的东西。

  • 我的列当前由 Field 对象引用,行由索引引用,值是对象,几乎总是字符串
  • 有些列会有很多重复值,
  • 主要操作是根据某些字段的值更新或删除记录,以及添加/删除/组合列

我知道 H2 和 Derby 等选项,但在这种情况下,我不打算使用嵌入式数据库。

编辑:如果您建议图书馆,如果您能向我指出其中适用于此处的一两个特定类,我也将不胜感激。虽然 Sun 的文档通常包含有关哪些操作是 O(1)、哪些操作是 O(N) 等的信息,但我在第三方库中没有看到太多此类信息,也没有真正描述哪些类最适合什么操作。

I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.

This question asks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).

So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.

  • My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
  • Some columns will have a lot of repeated values
  • primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns

I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.

EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

停顿的约定 2024-10-05 15:14:12

有些栏目会有很多
重复值

立即向我表明可能使用 FlyWeight 模式,无论您为集合选择哪种解决方案。

Some columns will have a lot of
repeated values

immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.

爱*していゐ 2024-10-05 15:14:12

Trove 集合 应该特别关心占用的空间(我认为如果您坚持原始类型,它们也有定制的数据结构).. 看看 此处

否则,您可以尝试使用 Apache 集合..只需进行基准测试即可!

无论如何,如果您对相同元素有很多引用,请尝试设计一些合适的模式(例如 flyweight)

Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.

Otherwise you can try with Apache collections.. just do your benchmarks!

In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)

盛夏已如深秋| 2024-10-05 15:14:12

Chronicle Map 每个条目的开销可能少于 20 个字节(请参阅 一个测试证明了这一点)。作为比较,java.util.HashMap 的开销从使用 -XX:+UsecompressedOops 的 37-42 字节到没有压缩 oops 的 58-69 字节(参考)。

此外,Chronicle Map 在堆外存储键和值,因此它不存储对象标头,这些标头不计入上述 HashMap 的开销。 Chronicle Map 集成Chronicle-Values,一个用于生成接口的享元实现的库,模式 布莱恩·阿格纽在另一个答案中建议

Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops (reference).

Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrates with Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer.

穿透光 2024-10-05 15:14:12

因此,我假设您有一个 Map 地图,其中列实际上类似于 ArrayList

几种可能性 -

  • 您完全确定内存是一个问题吗?如果您只是普遍担心大小,那么值得确认这确实是正在运行的程序中的一个问题。需要大量的行和映射来填满 JVM。

  • 您可以使用集合中不同类型的地图来测试数据集。根据您的数据,您还可以使用可能有帮助的预设大小/负载系数组合来初始化地图。我过去曾对此进行过尝试,如果幸运的话,您可能会减少 30% 的内存。

  • 如何将数据存储在单个类似矩阵的数据结构(现有的库实现或类似列表列表的包装器)中,并使用将列键映射到矩阵列的单个映射?

So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.

A few possibilities -

  • Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.

  • You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.

  • What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?

哭泣的笑容 2024-10-05 15:14:12

假设所有行都有大部分相同的列,您可以为每行使用一个数组,并使用 Map 。查找哪些列引用哪个单元格。这样每个单元只有 4-8 字节的开销。

如果字符串经常重复,可以使用字符串池来减少字符串的重复。其他不可变类型的对象池可能有助于减少内存消耗。

编辑:您可以将数据构建为基于行或基于列。如果基于行(每行一个单元格数组),添加/删除该行只需删除该行即可。如果它是基于列的,则每列可以有一个数组。这可以使处理原始类型更加有效。即,您可以有一列是 int[],另一列是 double[],更常见的是整列具有相同的数据类型,而不是整行具有相同的数据类型。

但是,无论采用哪种方式构建数据,都将针对行或列修改进行优化,并且执行其他类型的添加/删除将导致整个数据集的重建。

(我做的事情是拥有基于行的数据并将columns添加到末尾,假设如果行不够长,则该列有一个默认值,这可以避免在添加列时重建。我没有删除列,而是一种忽略它的方法)

Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.

If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.

EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.

However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.

(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)

梦幻的味道 2024-10-05 15:14:12

Guava 确实包含一个表格接口和基于哈希的实现。似乎很适合您的问题。请注意,这仍然标记为测试版。

Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.

再见回来 2024-10-05 15:14:12

将其数据保存在 HashMap 的 ArrayList 中
嗯,这部分对我来说效率非常低。空的 HashMap 将已经分配16 * 指针大小 字节(16 代表默认初始容量),以及哈希对象的一些变量(14 + psize)。如果您有很多稀疏填充的行,这可能是一个大问题。

一种选择是使用具有复合键(组合行和列)的单个大哈希。尽管如此,这并不能使整行的操作非常有效。

另外,由于您没有提及添加单元的操作,因此您可以仅使用必要的内部存储(initialCapacity 参数)创建哈希。

我对谷歌收藏不太了解,所以无法提供帮助。另外,如果您发现任何有用的优化,请在此发布!知道这一点会很有趣。

keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.

One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.

Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity parameter).

I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.

天气好吗我好吗 2024-10-05 15:14:12

我一直在尝试使用 Colt 项目中的 SparseObjectMatrix2D 。我的数据非常密集,但它们的 Matrix 类并没有真正提供任何放大它们的方法,因此我将稀疏矩阵设置为最大大小。

对于相同的数据,它使用的内存大约减少了 10%,加载速度提高了大约 15%,并且提供了一些巧妙的操作方法。但仍然对其他选择感兴趣。

I've been experimenting with using the SparseObjectMatrix2D from the Colt project. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.

It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.

半步萧音过轻尘 2024-10-05 15:14:12

对我来说,apache commons 集合没有节省任何空间,下面是 OOME 之前的两个类似的堆转储,比较了 Java 11 HashMap 和 Apache Commons HashedMap:
输入图片此处描述
输入图片此处的描述

Apache Commons HashedMap 似乎没有产生任何有意义的差异。

For me apache commons collections did not save any space, here are two similar heap dumps just before OOME comparing Java 11 HashMap to Apache Commons HashedMap:
enter image description here
enter image description here

The Apache Commons HashedMap doesn't appear to make any meaningful difference.

乖不如嘢 2024-10-05 15:14:12

从您的描述来看,您似乎不需要 HashMap 的 ArrayList,而是需要 ArrayList 的(链接)HashMap (每个 ArrayList 都是一列)。

我将添加一个从字段名到列号的双重映射,以及一些永远不会抛出 IndexOutOfBoundsException 的巧妙 getter/setter。

您还可以使用ArrayList>(基本上是一个锯齿状的动态增长矩阵)并将到字段(列)名称的映射保留在外部。

有些栏目会有很多
重复值

我怀疑这很重要,特别是如果它们是字符串(它们是内化的)并且您的集合将存储对它们的引用。

From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column).

I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException.

You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.

Some columns will have a lot of
repeated values

I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.

老旧海报 2024-10-05 15:14:12

为什么不尝试使用像 EHCache 这样的缓存实现。
当我遇到同样的情况时,这对我来说非常有效。
您可以简单地将集合存储在 EHcache 实现中。
有如下配置:

Maximum bytes to be used from Local heap.

一旦应用程序使用的字节溢出缓存中配置的字节,缓存实现就会负责将数据写入磁盘。您还可以配置使用最近最少使用算法将对象写入磁盘之前的时间量。
使用这种类型的缓存实现,您可以确保避免任何内存不足错误。
它只会少量增加应用程序的 IO 操作。
这只是配置的鸟瞰图。有很多配置可以优化您的需求。

Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:

Maximum bytes to be used from Local heap.

Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文