java:非常大的树?
目标是建造非常大的树。 我所说的“非常大”是指数亿个节点,相当于几千兆字节。
问题是通用数据结构的开销太大。我买不起“节点”对象和子“地图”。我需要以非常紧凑的方式将其直接编码到内存中。
因此,我想知道是否存在一些以整数作为键和值的内存有效实现树,而不在内部使用对象,因此需要(4字节键+4字节值+4字节子索引+一些免费字节哈希空间 = 每个条目平均 15 个字节),这将允许我使用外部映射 int<->keys 和 int<->values 来搜索树。
有人吗?
PS:内部使用对象至少使用5倍的空间:8个引用+4个额外哈希空间+16个对象头+8个键引用+8个值引用+8个父引用+8个子引用+子映射obj的(16+x) = 每个条目近 76+x 字节。 (例如,我们的默认实现每个条目需要大约 100 个字节)
The objective is to build very large trees.
By very large I mean hundreds of millions of nodes, fitting in a few gigabytes.
The issue is that common data structures have way too much overhead. I cannot afford to have "node" objects and children "maps". I need to directly encode it into memory in a very compact way.
Therefore, I was wondering if there existed some memory efficient implementation of trees having integers as key and values, without using objects internally, therefore needing (4 byte for key + 4 bytes for value + 4 bytes for children index + a few bytes for free hashing space = 15 bytes per entry on average) which would allow me to use an external mapping int<->keys and int<->values to search the tree.
Anyone?
PS: Using objects internally uses at least 5 times more space: 8 reference + 4 extra hash space + 16 object header + 8 key ref + 8 value ref + 8 parent ref + 8 children ref + (16 + x) for children map obj = nearly 76+x bytes per entry. (for instance, our default implementation needed around 100 bytes per entry)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这实际上不是 Java 特定的问题,而是一个普遍的概念。
试试这个: http://webdocs.cs.ualberta.ca /~holte/T26/tree-as-array.html
关键是使用基元数组,以避免对象开销。
That's actually not a Java specific question but more of a general concept.
Try this: http://webdocs.cs.ualberta.ca/~holte/T26/tree-as-array.html
The key would be to use arrays of primitives, in order to avoid object overhead.
我不知道有任何特定的树实现可以完全做到这一点,但 VTD-XML 在内部使用带有指向缓冲区的指针的标记数组来表示 XML 树(DOM)。也许您可以从他们的解决方案中获得启发?
I don't know of any specific tree implementation that does exactly that, but VTD-XML represents an XML tree (the DOM) internally using an array of tokens with pointers into a buffer. Perhaps you can get inspired by their solution?
您可能会发现这个库可以帮助您实现您想要的 - 它是专门为将值存储为基元而设计的,并且在幕后进行了一些字节码黑客操作以给出存储对象的错觉。当...时使用它
它没有特定的 Tree 集合,但它可能会起作用,具体取决于您的需要。
http://code.google.com/p/vanilla-java/wiki/HugeCollections
You may find that this library helps you achieve what you want - it is specifically designed for storing values as primitives and does some bytecode hackery behind the scenes to give the illusion of storing Objects. Use it when...
It doesn't have a specific Tree collection, but it might do the trick, depending on what you need.
http://code.google.com/p/vanilla-java/wiki/HugeCollections
我认为您不会找到任何已经为您实现的内容,但是您所描述的内容可以使用 IntBuffer。您将创建一个“包装器”类,将索引转换为缓冲区中的记录,并根据需要实例化/丢弃这些包装器(即,当您遍历树时,您可能不关心关于保留对父级的引用)。
有几个问题:
TreeMap
等类为您完成了。但算法非常简单,可以从 Wikipedia 获取。I don't think you'll find anything already implemented for you, but what you've described could be very easily implemented using an IntBuffer. You'd create a "wrapper" class that converts indexes into records in the buffer, and instantiate/discard those wrappers as needed (ie, as you're traversing a tree, you probably don't care about holding a reference to the parent).
There are a couple of concerns:
TreeMap
. But the algorithms are pretty simple, and available from Wikipedia.每个节点都可以拥有对其父节点的引用,而不是保留子节点列表。因此,序列化节点不需要三个以上的整数值(父节点、键、值)。
这种方法的一个问题是树遍历。获得所有节点子节点的明确列表需要迭代所有节点。如果三元组按其父值排序,则可以改进遍历。再添加一个整数值,即指向下一个键的指针,将允许将节点保留在链表中,从而简化节点插入和删除的工作。
Instead of keeping a list of children, every node could have a reference to its parent. Thus serializing a node wouldn't need more than three integer values (parent, key, value).
A problem with this approach is tree traversal. Obtaining a definite list of all node children would require iterating through all nodes. If the tripplettes are sorted by their parent values traversal may be improved. Adding one more integer value, i.e. a pointer to the next key would allow keeping the nodes in a linked list easing the job of insertion and removal of nodes.