哈希码。如何使用

发布于 2024-09-07 22:49:23 字数 491 浏览 6 评论 0 原文

虽然我非常了解 HashCode 是什么以及哈希表的作用，但我不得不承认我不知道如何使用它（超出了普通字典的范围）。我想实现自己的哈希表，所以首先我想了解有关哈希的基本知识：

我知道我可以使用 getHashCode()/hashCode() 获取哈希码在 Java 和 Scala 中。这个数字是如何确定的。（只是出于好奇）
如果我知道一个对象的HashCode，我该如何访问它？也就是说，我该如何调用那个内存桶呢？
我可以更改/设置变量HashCode吗？

现在，我有一个非常大（大约 10^9）的 Int 列表。我将访问其中的一些（从无到全部），并且我需要以尽可能最快的方式完成它。哈希表是最好的方法吗？

PS：我不想讨论它，我只是想知道哈希表是否知道是最有效的。如果还有其他好的方法也许你可以指点我。

谢谢，

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乜一 2024-09-14 22:49:23

哈希码只是一个数字，保证每种类型的对象与原始对象“相同”。

这意味着为每个哈希码调用返回“0”将是有效的，但会弄巧成拙。关键是可以（并且在大多数情况下）存在重复项。

如果您知道对象的哈希码，则不一定可以访问它。根据上面的示例，如果所有对象都返回“0”，您仍然无法询问哪个对象的哈希代码为 0。但是，您可以询问哈希代码为 0 的所有对象并查看它们（这就是哈希表的作用，它通过仅获取具有相同哈希码的那些来减少迭代量，然后查看它们）。

如果要设置（更改）HashCode，它不会是哈希码，因为为具有给定“状态”的对象指定的值无法更改。

至于“最佳方法”，返回相同哈希代码的唯一对象越少，哈希表的性能就越好。如果你有一长串“int”，你可以使用该 int 值作为你的哈希码，你将得到罕见的完美哈希——其中每个对象都映射到一个哈希码。

请注意，哈希表并不真正适合这种存储整数的情况。对于您尝试存储使用其他机制不太容易唯一识别或比较的复杂对象的情况来说，这是更好的选择。

你的“Int列表”的问题是，如果你有数字5并且你想在你的表中查找它，你只会在那里找到数字5。

现在，如果您想查看表中是否存在数字 5，那就是另一回事了。

对于一组几乎没有漏洞的数字，您可以创建一个简单的布尔数组。如果 a[5] 存在（为真），则 a 在列表中。如果你的数字集非常稀疏（1, 5, 10002930304），那么这不是一个很好的解决方案，因为你将“False”存储在位置 2, 3, 4 中，然后在最后一个位置之前存储一大堆数字，但它是直接查找，无论您添加多少个数字，这一步都不会再花费任何时间 - O(1)。

您可以通过对字节数组进行二进制查找来使这种类型的存储更加密集，但除非您非常擅长位操作，否则请跳过它。它将涉及看起来像这样的东西：

public boolean doesNumberExist(int number) {
    return bytes[number / 8] & ( 1 << number % 8);
}

如果你的最高数字真的很大，这仍然会耗尽内存。

因此，对于大型稀疏列表，我将使用排序的整数数组而不是填充较少的布尔数组。一旦它被排序为数组，你只需进行二分搜索即可；从排序数组的中间开始，如果您想要的数字更大，则将列表的上半部分在中心划分并检查该数字，重复。

排序后的 int 数组需要更多的步骤，但不会太多，并且不会为不存在的数字浪费任何内存。

The hash code is just a number that is guaranteed to be the same for every type of object "the same" as the original object.

This means that returning "0" for every hash code call would be valid, but self-defeating. The point is there can (and in most cases will) be duplicates.

If you know the hash code of an object, you cannot necessarily access it. per my example above, if all objects returned "0", you still couldn't ask which object has hash code 0. However, you could ask for ALL objects with hash code 0 and look through them (this is what a hashtable does, it reduces the amount of iterating by getting just the ones with the same hash code, then looks through those).

If you were to set (Change) a HashCode, it would not be a hash code because the value given for an object with a given "State" cannot change.

As for the "Best Way" to do it, the fewer unique objects that return the same hash code, the better your hash tables will perform. If you have a long list of "int", you can just use that int value as your hash code and you will have that rare perfect hash--where each object maps to exactly one hash code.

Note that hashtable isn't really appropriate for this situation of storing ints. It's better for situations where you are trying to store complex objects that are not so easy to uniquely identify or compare using other mechanisms.

The problem with your "List of Int" is that if you have the number 5 and you want to look it up in your table, you are just going to find a number 5 there.

Now, if you want to see if the number 5 exists in your table or not, that's a different matter.

For a set of numbers with few holes you could make a simple boolean array. If a[5] exists (is true), than a is in the list. If your set of numbers is very sparse (1, 5, 10002930304) then that wouldn't be a very good solution since you'd store "False" in spots 2, 3, 4 and then a whole bunch of them before the last number, but it is a direct lookup, a single step that never takes any longer no matter how many numbers you add--O(1).

You could make this type of storage MUCH denser by making it a binary lookup into a byte array, but unless you're pretty good with bit-manipulation, skip it. It would involve stuff that looks like this:

public boolean doesNumberExist(int number) {
    return bytes[number / 8] & ( 1 << number % 8);
}

and this still runs out of memory if your highest number is really big.

So, for a large sparse list I'd use a sorted integer array instead of a lightly populated boolean array. Once it's sorted as an array you just do a binary search; start in the middle of the sorted array, If the number you want is higher, then divide the top half of the list in the center and check that number, repeat.

The sorted int array takes a few more steps but not too many more and it doesn't waste any memory for non-existent numbers.

回复收藏 0 原文