java:非常大的树?

发布于 2024-12-02 15:05:49 字数 408 浏览 0 评论 0原文

目标是建造非常大的树。 我所说的“非常大”是指数亿个节点,相当于几千兆字节。

问题是通用数据结构的开销太大。我买不起“节点”对象和子“地图”。我需要以非常紧凑的方式将其直接编码到内存中。

因此,我想知道是否存在一些以整数作为键和值的内存有效实现树,而不在内部使用对象,因此需要(4字节键+4字节值+4字节子索引+一些免费字节哈希空间 = 每个条目平均 15 个字节),这将允许我使用外部映射 int<->keys 和 int<->values 来搜索树。

有人吗?

PS:内部使用对象至少使用5倍的空间:8个引用+4个额外哈希空间+16个对象头+8个键引用+8个值引用+8个父引用+8个子引用+子映射obj的(16+x) = 每个条目近 76+x 字节。 (例如,我们的默认实现每个条目需要大约 100 个字节)

The objective is to build very large trees.
By very large I mean hundreds of millions of nodes, fitting in a few gigabytes.

The issue is that common data structures have way too much overhead. I cannot afford to have "node" objects and children "maps". I need to directly encode it into memory in a very compact way.

Therefore, I was wondering if there existed some memory efficient implementation of trees having integers as key and values, without using objects internally, therefore needing (4 byte for key + 4 bytes for value + 4 bytes for children index + a few bytes for free hashing space = 15 bytes per entry on average) which would allow me to use an external mapping int<->keys and int<->values to search the tree.

Anyone?

PS: Using objects internally uses at least 5 times more space: 8 reference + 4 extra hash space + 16 object header + 8 key ref + 8 value ref + 8 parent ref + 8 children ref + (16 + x) for children map obj = nearly 76+x bytes per entry. (for instance, our default implementation needed around 100 bytes per entry)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

等待我真够勒 2024-12-09 15:05:49

这实际上不是 Java 特定的问题,而是一个普遍的概念。

试试这个: http://webdocs.cs.ualberta.ca /~holte/T26/tree-as-array.html

关键是使用基元数组,以避免对象开销。

That's actually not a Java specific question but more of a general concept.

Try this: http://webdocs.cs.ualberta.ca/~holte/T26/tree-as-array.html

The key would be to use arrays of primitives, in order to avoid object overhead.

緦唸λ蓇 2024-12-09 15:05:49

我不知道有任何特定的树实现可以完全做到这一点,但 VTD-XML 在内部使用带有指向缓冲区的指针的标记数组来表示 XML 树(DOM)。也许您可以从他们的解决方案中获得启发?

I don't know of any specific tree implementation that does exactly that, but VTD-XML represents an XML tree (the DOM) internally using an array of tokens with pointers into a buffer. Perhaps you can get inspired by their solution?

时光礼记 2024-12-09 15:05:49

您可能会发现这个库可以帮助您实现您想要的 - 它是专门为将值存储为基元而设计的,并且在幕后进行了一些字节码黑客操作以给出存储对象的错觉。当...时使用它

...您希望在内存中有效地存储大量数据。该库可以显着减少 Full GC 时间并减少内存消耗。

它没有特定的 Tree 集合,但它可能会起作用,具体取决于您的需要。

http://code.google.com/p/vanilla-java/wiki/HugeCollections

You may find that this library helps you achieve what you want - it is specifically designed for storing values as primitives and does some bytecode hackery behind the scenes to give the illusion of storing Objects. Use it when...

... you want to efficiently store large collections of data in memory. This library can dramatically reduce Full GC times and reduce memory consumption as well.

It doesn't have a specific Tree collection, but it might do the trick, depending on what you need.

http://code.google.com/p/vanilla-java/wiki/HugeCollections

孤独岁月 2024-12-09 15:05:49

我认为您不会找到任何已经为您实现的内容,但是您所描述的内容可以使用 IntBuffer。您将创建一个“包装器”类,将索引转换为缓冲区中的记录,并根据需要实例化/丢弃这些包装器(即,当您遍历树时,您可能不关心关于保留对父级的引用)。

有几个问题:

  • 包装器实例的垃圾收集:只要它们是短暂的,它们就永远不会离开 Eden,因此 GC 几乎是免费的。
  • 缓冲区内记录的垃圾收集:如果您有一个一次性写入树,那么这没有问题。否则,您需要维护一个空闲列表。不难,但需要一些时间。
  • 实现树的一般机制:这已经通过 TreeMap 等类为您完成了。但算法非常简单,可以从 Wikipedia 获取。

I don't think you'll find anything already implemented for you, but what you've described could be very easily implemented using an IntBuffer. You'd create a "wrapper" class that converts indexes into records in the buffer, and instantiate/discard those wrappers as needed (ie, as you're traversing a tree, you probably don't care about holding a reference to the parent).

There are a couple of concerns:

  • Garbage collection of the wrapper instances: as long as they're short-lived, they never get out of Eden, so GC is almost free.
  • Garbage collection of records within the buffer: if you have a write-once tree, this is no problem. Otherwise, you'll need to maintain a free list. Not difficult, but it takes some time.
  • General mechanics of implementing a tree: this is already done for you with classes like TreeMap. But the algorithms are pretty simple, and available from Wikipedia.
空心空情空意 2024-12-09 15:05:49

每个节点都可以拥有对其父节点的引用,而不是保留子节点列表。因此,序列化节点不需要三个以上的整数值(父节点、键、值)。

这种方法的一个问题是树遍历。获得所有节点子节点的明确列表需要迭代所有节点。如果三元组按其父值排序,则可以改进遍历。再添加一个整数值,即指向下一个键的指针,将允许将节点保留在链表中,从而简化节点插入和删除的工作。

Instead of keeping a list of children, every node could have a reference to its parent. Thus serializing a node wouldn't need more than three integer values (parent, key, value).

A problem with this approach is tree traversal. Obtaining a definite list of all node children would require iterating through all nodes. If the tripplettes are sorted by their parent values traversal may be improved. Adding one more integer value, i.e. a pointer to the next key would allow keeping the nodes in a linked list easing the job of insertion and removal of nodes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文