哈希码。如何使用

发布于 2024-09-07 22:49:23 字数 491 浏览 6 评论 0 原文

虽然我非常了解 HashCode 是什么以及哈希表的作用,但我不得不承认我不知道如何使用它(超出了普通字典的范围)。我想实现自己的哈希表,所以首先我想了解有关哈希的基本知识:

  • 我知道我可以使用 getHashCode()/hashCode() 获取哈希码在 Java 和 Scala 中。这个数字是如何确定的。 (只是出于好奇)
  • 如果我知道一个对象的HashCode,我该如何访问它?也就是说,我该如何调用那个内存桶呢?
  • 我可以更改/设置变量HashCode吗?

现在,我有一个非常大(大约 10^9)的 Int 列表。我将访问其中的一些(从无到全部),并且我需要以尽可能最快的方式完成它。哈希表是最好的方法吗?

PS:我不想讨论它,我只是想知道哈希表是否知道是最有效的。如果还有其他好的方法也许你可以指点我。

谢谢,

Although I understand very well what HashCode is and what a Hash Table does, I have to admit I don´t know how to use it (Beyond a common dictionary). I wanted to implement my own Hash Table so first I want to know the very basic about Hash:

  • I know I can get the hash code with getHashCode()/hashCode() in Java and Scala. How is this number determined. (Just out of curiosity)
  • If I know the HashCode of an object, how can I access it? That is, how can I call that memory bucket?
  • Can I change/set a variables HashCode?

Now, I have a very big (about 10^9) list of Int. I am going to access some of them (from none to all) and I need to do it in the fastest way possible. Is a hash table THE BEST way to do it?

PS: I don't want to discuss it, I just want to know if the HashTable is know to be the most efficient. If there exist other good methods maybe you can point me to them.

Thanks,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

乜一 2024-09-14 22:49:23

哈希码只是一个数字,保证每种类型的对象与原始对象“相同”。

这意味着为每个哈希码调用返回“0”将是有效的,但会弄巧成拙。关键是可以(并且在大多数情况下)存在重复项。

如果您知道对象的哈希码,则不一定可以访问它。根据上面的示例,如果所有对象都返回“0”,您仍然无法询问哪个对象的哈希代码为 0。但是,您可以询问哈希代码为 0 的所有对象并查看它们(这就是哈希表的作用,它通过仅获取具有相同哈希码的那些来减少迭代量,然后查看它们)。

如果要设置(更改)HashCode,它不会是哈希码,因为为具有给定“状态”的对象指定的值无法更改。

至于“最佳方法”,返回相同哈希代码的唯一对象越少,哈希表的性能就越好。如果你有一长串“int”,你可以使用该 int 值作为你的哈希码,你将得到罕见的完美哈希——其中每个对象都映射到一个哈希码。

请注意,哈希表并不真正适合这种存储整数的情况。对于您尝试存储使用其他机制不太容易唯一识别或比较的复杂对象的情况来说,这是更好的选择。

你的“Int列表”的问题是,如果你有数字5并且你想在你的表中查找它,你只会在那里找到数字5。

现在,如果您想查看表中是否存在数字 5,那就是另一回事了。

对于一组几乎没有漏洞的数字,您可以创建一个简单的布尔数组。如果 a[5] 存在(为真),则 a 在列表中。如果你的数字集非常稀疏(1, 5, 10002930304),那么这不是一个很好的解决方案,因为你将“False”存储在位置 2, 3, 4 中,然后在最后一个位置之前存储一大堆数字,但它是直接查找,无论您添加多少个数字,这一步都不会再花费任何时间 - O(1)。

您可以通过对字节数组进行二进制查找来使这种类型的存储更加密集,但除非您非常擅长位操作,否则请跳过它。它将涉及看起来像这样的东西:

public boolean doesNumberExist(int number) {
    return bytes[number / 8] & ( 1 << number % 8);
}

如果你的最高数字真的很大,这仍然会耗尽内存。

因此,对于大型稀疏列表,我将使用排序的整数数组而不是填充较少的布尔数组。一旦它被排序为数组,你只需进行二分搜索即可;从排序数组的中间开始,如果您想要的数字更大,则将列表的上半部分在中心划分并检查该数字,重复。

排序后的 int 数组需要更多的步骤,但不会太多,并且不会为不存在的数字浪费任何内存。

The hash code is just a number that is guaranteed to be the same for every type of object "the same" as the original object.

This means that returning "0" for every hash code call would be valid, but self-defeating. The point is there can (and in most cases will) be duplicates.

If you know the hash code of an object, you cannot necessarily access it. per my example above, if all objects returned "0", you still couldn't ask which object has hash code 0. However, you could ask for ALL objects with hash code 0 and look through them (this is what a hashtable does, it reduces the amount of iterating by getting just the ones with the same hash code, then looks through those).

If you were to set (Change) a HashCode, it would not be a hash code because the value given for an object with a given "State" cannot change.

As for the "Best Way" to do it, the fewer unique objects that return the same hash code, the better your hash tables will perform. If you have a long list of "int", you can just use that int value as your hash code and you will have that rare perfect hash--where each object maps to exactly one hash code.

Note that hashtable isn't really appropriate for this situation of storing ints. It's better for situations where you are trying to store complex objects that are not so easy to uniquely identify or compare using other mechanisms.

The problem with your "List of Int" is that if you have the number 5 and you want to look it up in your table, you are just going to find a number 5 there.

Now, if you want to see if the number 5 exists in your table or not, that's a different matter.

For a set of numbers with few holes you could make a simple boolean array. If a[5] exists (is true), than a is in the list. If your set of numbers is very sparse (1, 5, 10002930304) then that wouldn't be a very good solution since you'd store "False" in spots 2, 3, 4 and then a whole bunch of them before the last number, but it is a direct lookup, a single step that never takes any longer no matter how many numbers you add--O(1).

You could make this type of storage MUCH denser by making it a binary lookup into a byte array, but unless you're pretty good with bit-manipulation, skip it. It would involve stuff that looks like this:

public boolean doesNumberExist(int number) {
    return bytes[number / 8] & ( 1 << number % 8);
}

and this still runs out of memory if your highest number is really big.

So, for a large sparse list I'd use a sorted integer array instead of a lightly populated boolean array. Once it's sorted as an array you just do a binary search; start in the middle of the sorted array, If the number you want is higher, then divide the top half of the list in the center and check that number, repeat.

The sorted int array takes a few more steps but not too many more and it doesn't waste any memory for non-existent numbers.

可是我不能没有你 2024-09-14 22:49:23

哈希函数返回一个整数。您使用该整数(键)作为索引来存储您的信息。在java中,可以使用java.util.Hashtable。您始终可以自己推出,它可以像使用键作为索引的数组一样简单。

对于您的程序,您确实需要弄清楚如何访问这些元素。哈希表提供对特定项目的超快速访问,但不(不应该)提供顺序访问

如果您使用 java,请检查哈希表并查看这些方法是否足以满足您的应用程序:

http://java.sun.com/j2se/1.4.2/文档/api/java/util/Hashtable.html

A hashing function returns an integer. You use that integer (key) as an index to store your information. In java, you can use java.util.Hashtable. You can always roll your own, it can be as simple as an array that uses the key as the index.

For your program, you really need to figure out how you need to access the elements. A hashtable offers super fast access to a specific item, but doesn't (shouldn't) offer sequential access

If you're using java, check out hashtable and see if the methods are sufficient for your application:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/Hashtable.html

梦中的蝴蝶 2024-09-14 22:49:23

Int 的大列表用作我通过索引访问的查找表。然后我猜索引将是键,列表元素是值。希望能澄清这一点

在这种情况下,java.util.HashTable 并不比 java.util.ArrayList 更好。 HashTable 会消耗至少两倍的内存,同时访问速度稍慢。

比 ArrayList 更好的是普通的 int[],因为不需要创建和存储 Integer 实例。我估计这将减少 3 倍的内存消耗。

然而,在内存中保留 10^9 int 仍然是一个令人畏惧的提议,因为每个 int 消耗 4 个字节的内存。那是 4 GB。您可能希望将列表的至少一部分存储在磁盘而不是内存上,并使用例如 RandomAccessFile 来查找正在查找的索引。

The big list of Int works as a look up table that I access through index. Then I guess the index would be the key and the list elements are values. Hope that clarifies it

In that case, a java.util.HashTable is not better than an java.util.ArrayList. A HashTable would consume at least twice the memory, while offering slightly slower access.

Even better than the ArrayList is a plain int[], as no Integer instances need to be created and stored. I estimate this will reduce memory consumption by a factor of 3.

However, keeping 10^9 int in memory remains a daunting proposition, as each int consumes 4 bytes of memory. That's 4 GB. You might wish to keep at least part of the list stored on disk rather than memory, and use, for instance, RandomAccessFile to seek to the index being looked up.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文