随机访问大量对象(如哈希表)的建议
我正在处理一些生成的数据文件(数百兆字节),其中包含多个 G
对象。我需要随机访问这些对象。我猜想,一个可能的实现可能是一个大的HashTable
。我的程序是用 Java 编写的,java.util.HashMap
似乎无法处理这个问题(不知怎的,它非常慢)。有人可以推荐一个随机访问这些对象的解决方案吗?
I'm processing some generated data files (hundreds of Mbytes) which contains several G
objects. I need to random access these objects. A possible implementation, I guess, might be a big HashTable
. My program is written in Java and it seems the java.util.HashMap
cannot handle this (somehow it's extremely slow). Could anyone recommend a solution to random accessing these objects?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果
HashMap
非常慢,那么两个最可能的原因如下:hashCode()
和/或equals(Object) 方法可能非常昂贵。例如,如果您使用数组或集合作为键,
hashCode()
方法将在您每次调用它时访问每个元素,并且等于
方法将对相等的键执行相同的操作。您的键类可能有一个糟糕的
hashCode()
方法,该方法为程序使用的很大一部分(不同的)键提供相同的值。发生这种情况时,您会遇到许多键冲突,当哈希表变大时,这可能会严重影响性能。我建议您在更改数据结构之前先看看这些可能性。
注意:如果“几个 G 对象”意味着数十亿个对象,那么您将很难将文件内容保存在内存中......除非您在具有 100 GB RAM 的计算机上运行此应用程序。我建议你做一些“粗略的”计算,看看你想做的事情是否可行。
If a
HashMap
is extremely slow, then the two most likely causes are as follows:The
hashCode()
and/orequals(Object)
methods on your key class could be very expensive. For instance, if you use an array or a collection as a key, thehashCode()
method will access every element each time you call it, and theequals
method will do the same for equal keys.Your key class could have a poor
hashCode()
method that is giving the same value for a significant percentage of the (distinct) keys used by the program. When this occurs you get many key collisions, and that can be really bad for performance when the hash table gets large.I suggest you look at these possibilities first ... before changing your data structure.
Note: if "several G objects" means several billion objects, then you'll have trouble holding the files' contents in memory ... unless you are running this application on a machine with 100's of gigabytes of RAM. I advise you do some "back of the envelope" calculations to see if what you are trying to do is feasible.
无论您的密钥是什么,请确保通过
hashCode()
为每个密钥生成一个良好的哈希值。很多时候,HashMap 性能不佳可以归咎于哈希冲突。当发生碰撞时,HashMap 会为碰撞对象生成一个链表。最坏的情况是,如果您为所有对象返回相同的哈希值,则 HashMap 本质上会变成一个链接列表。这是编写哈希函数的一个很好的起点: http://www.javamex.com/tutorials /collections/hash_function_guidelines.shtml
Whatever your keys are, make sure you're generating a good hash for each one via
hashCode()
. A lot of times bad HashMap performance can be blamed on colliding hashes. When there's a collision, HashMap generates a linked list for the colliding objects.Worst-case if you're returning the same hash for all objects, HashMap essentially becomes a linked list. Here's a good starting place for writing hash functions: http://www.javamex.com/tutorials/collections/hash_function_guidelines.shtml
几百 MB 无法容纳数十亿个对象,除非每个对象都是一个位(恕我直言,这不是真正的对象)。
我的方法是使用内存映射文件来映射数据的内容,并在另一个内存映射文件中构建您自己的哈希表(这需要您扫描一次数据来构建键),
具体取决于数据,值得记住的是,随机访问并不是缓存数据的最有效方法,即缓存加载了 64 字节的行(取决于体系结构),如果您的结构不适合内存,基于记录的表可能会更有效。
A few hundred MB cannot hold several billion objects unless each object is a bit (which is not really an object IMHO).
How I would approach this is to use memory mapped file to map in the contents of the data and to build your own hash table in another memory mapped file (which requires you to scan the data once to build the keys)
Depending on the layout of the data, it is worth remembering that random access is not the most efficient way to cache data i.e. your cache loaded lines of 64 bytes (depending on architecture) and if your structure doesn't fit in memory, record based tables may be more efficient.