当前位置：文江博客话题详情

为什么Java中不同的对象hashCode()可以返回相同的值？

发布于 2024-10-05 23:35:22 字数 323 浏览 0 评论 0原文

我正在阅读的书中引用Head First Java< /a>:

重点是，哈希码可以相同，但不一定保证对象相等，因为 hashCode() 方法中使用的“哈希算法”可能碰巧为多个对象返回相同的值对象。

为什么 hashCode() 方法可能为不同的对象返回相同的值？这不会造成问题吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

微暖i 2024-10-12 23:35:22

对对象进行哈希处理意味着“找到一个好的、描述性的值（数字），该值可以由同一实例一次又一次地再现”。由于 Java 的 Object.hashCode() 中的哈希码的类型为 int，因此只能有 2^32 不同的值。这就是为什么当两个不同的对象产生相同的 hashCode 时，根据哈希算法，您将出现所谓的“冲突”。

通常，这不会产生任何问题，因为 hashCode() 主要与 equals() 一起使用。例如，HashMap 将对其键调用 hashCode()，以了解键是否已包含在 HashMap 中。如果 HashMap 没有找到哈希码，则显然该键尚未包含在 HashMap 中。但如果确实如此，则必须使用 equals() 仔细检查具有相同哈希码的所有键。

即

A.hashCode() == B.hashCode() // does not necessarily mean
A.equals(B)

但是

A.equals(B) // means
A.hashCode() == B.hashCode()

如果 equals() 和 hashCode() 正确实现。

有关常规 hashCode 合约的更准确描述，请参阅 Javadoc。

hashing an object means "finding a good, descriptive value (number) that can be reproduced by the very same instance again and again". Because hash codes from Java's Object.hashCode() are of type int, you can only have 2^32 different values. That's why you will have so-called "collisions" depending on the hashing algorithm, when two distinct Objects produce the same hashCode.

Typically, this does not produce any problems, because hashCode() is mostly used together with equals(). For instance, a HashMap will call hashCode() upon its keys, to know whether the keys may already be contained in the HashMap. If the HashMap does not find the hash code, it's obvious the key is not contained in the HashMap yet. But if it does, it will have to double-check all keys having that same hash code using equals().

I.e.

A.hashCode() == B.hashCode() // does not necessarily mean
A.equals(B)

But

A.equals(B) // means
A.hashCode() == B.hashCode()

If equals() and hashCode() are implemented correctly.

For a more precise description of the general hashCode contract, see the Javadoc.

回复收藏 0 原文

红ご颜醉 2024-10-12 23:35:22

可能的哈希码只有超过 40 亿个（int 的范围），但您可以选择创建的对象数量要大得多。因此，根据鸽子洞原理，某些对象必须共享相同的哈希码。

例如，包含来自 AZ 的 10 个字母的可能字符串数量为 26**10，即 141167095653376。不可能为所有这些字符串分配唯一的哈希码。这也不重要——哈希码不需要是唯一的。它只需要与真实数据没有太多冲突即可。

回复收藏 0 原文

听你说爱我 2024-10-12 23:35:22

哈希表的想法是您希望能够以有效的方式实现称为字典的数据结构。字典是一种键/值存储，即您希望能够在某个键下存储某些对象，然后能够使用相同的键再次检索它们。

访问值最有效的方法之一是将它们存储在数组中。例如，我们可以实现一个使用整数作为键和字符串作为值的字典，如下所示：

String[] dictionary = new String[DICT_SIZE];
dictionary[15] = "Hello";
dictionary[121] = "world";

System.out.println(dictionary[15]); // prints "Hello"

不幸的是，这种方法根本不是很通用：数组的索引必须是整数值，但理想情况下我们希望能够使用任意类型的对象作为我们的键，而不仅仅是整数。

现在，解决这个问题的方法是找到一种将任意对象映射到整数值的方法，然后我们可以将其用作数组的键。在 Java 中，这就是 hashCode() 的作用。现在，我们可以尝试实现一个 String->String 字典：

String[] dictionary = new String[DICT_SIZE];
// "a" -> "Hello"
dictionary["a".hashCode()] = "Hello";

// "b" -> "world"
dictionary["b".hashCode()] = "world";

System.out.println(dictionary["b".hashCode()]); // prints world

但是，嘿，如果我们想将某个对象用作键，但它的 hashCode 方法返回一个值怎么办？大于或等于 DICT_SIZE？然后我们会得到一个 ArrayIndexOutOfBoundsException，这是不希望的。那么，我们就让它尽可能大吧？

public static final int DICT_SIZE = Integer.MAX_VALUE // Ooops!

但这意味着我们必须为数组分配大量内存，即使我们只打算存储几个项目。所以这并不是最好的解决方案，事实上我们可以做得更好。假设我们有一个函数 h，对于任何给定的 DICT_SIZE，将任意整数映射到 [0, DICT_SIZE[] 范围内。然后，我们可以将 h 应用于键对象返回的任何 hashCode() 方法，并确保我们停留在底层数组的边界内。

public static int h(int value, int DICT_SIZE) {
    // returns an integer >= 0 and < DICT_SIZE for every value.
}

该函数称为哈希函数。现在我们可以调整字典实现来避免 ArrayIndexOutOfBoundsException：

// "a" -> "Hello"
dictionary[h("a".hashCode(), DICT_SIZE)] = "Hello"

// "b" -> "world"
dictionary[h("b".hashCode(), DICT_SIZE)] = "world"

但这引入了另一个问题：如果 h 将两个不同的键索引映射到相同的值怎么办？例如：

int keyA = h("a".hashCode(), DICT_SIZE);
int keyB = h("b".hashCode(), DICT_SIZE);

keyA 和 keyB 可能会产生相同的值，在这种情况下，我们会意外地覆盖数组中的值：

// "a" -> "Hello"
dictionary[keyA] = "Hello";

// "b" -> "world"
dictionary[keyB] = "world"; // DAMN! This overwrites "Hello"!!

System.out.println(dictionary[keyA]); // prints "world"

好吧，你可能会说，那么我们只是必须确保我们以永远不会发生这种情况的方式实现 h。不幸的是，这通常是不可能的。考虑以下代码：

for (int i = 0; i <= DICT_SIZE; i++) {
    dictionary[h(i, DICT_SIZE)] = "dummy";
}

此循环在字典中存储 DICT_SIZE + 1 值（实际上始终是相同的值，即字符串“dummy”）。嗯，但是数组只能存储 DICT_SIZE 不同的条目！这意味着，当我们使用 h 时，我们将覆盖（至少）一个条目。或者换句话说，h 会将两个不同的键映射到相同的值！这些“碰撞”是无法避免的：如果n只鸽子试图进入n-1个鸽子洞，那么至少其中两只必须进入同一个洞。

但我们可以做的是扩展我们的实现，以便数组可以在同一索引下存储多个值。这可以通过使用列表轻松完成。因此，我们不使用：，而是

String[] dictionary = new String[DICT_SIZE];

写：（

List<String>[] dictionary = new List<String>[DICT_SIZE];

旁注：请注意，Java 不允许创建泛型类型的数组，因此上面的行无法编译 - 但您明白了）。

这将改变对字典的访问，如下所示：

// "a" -> "Hello"
dictionary[h("a".hashCode(), DICT_SIZE)].add("Hello");

// "b" -> "world"
dictionary[h("b".hashCode(), DICT_SIZE)].add("world");

如果我们的哈希函数 h 为所有键返回不同的值，这将导致列表中每个列表只有一个元素，并且检索元素非常简单：

System.out.println(dictionary[h("a".hashCode(), DICT_SIZE)].get(0)); // "Hello"

但是我们已经知道，通常 h 有时会将不同的键映射到同一个整数。在这些情况下，列表将包含多个值。为了检索，我们必须遍历整个列表才能找到“正确”的值，但是我们如何识别它呢？

好吧，我们可以始终将完整的（键，值）对存储在列表中，而不是单独存储值。然后，查找将分两步执行：

应用哈希函数从数组中检索正确的列表。
迭代存储在检索列表中的所有对：如果找到具有所需键的对，则返回该对中的值。

现在添加和检索已经变得如此复杂，以至于为这些操作单独处理方法并不不雅：

List<Pair<String,String>>[] dictionary = List<Pair<String,String>>[DICT_SIZE];

public void put(String key, String value) {
    int hashCode = key.hashCode();
    int arrayIndex = h(hashCode, DICT_SIZE);

    List<Pair<String,String>> listAtIndex = dictionary[arrayIndex];
    if (listAtIndex == null) {
        listAtIndex = new LinkedList<Pair<Integer,String>>();
        dictionary[arrayIndex] = listAtIndex;
    }

    for (Pair<String,String> previouslyAdded : listAtIndex) {
        if (previouslyAdded.getKey().equals(key)) {
            // the key is already used in the dictionary,
            // so let's simply overwrite the associated value
            previouslyAdded.setValue(value);
            return;
        }
    }

    listAtIndex.add(new Pair<String,String>(key, value));
}

public String get(String key) {
    int hashCode = key.hashCode();
    int arrayIndex = h(hashCode, DICT_SIZE);

    List<Pair<String,String>> listAtIndex = dictionary[arrayIndex];
    if (listAtIndex != null) {
        for (Pair<String,String> previouslyAdded : listAtIndex) {
            if (previouslyAdded.getKey().equals(key)) {
                return previouslyAdded.getValue(); // entry found!
            }
        }
    }

    // entry not found
    return null;
}

因此，为了使这种方法起作用，我们实际上需要两个比较操作：hashCode 方法来查找数组中的列表（这个如果 hashCode() 和 h 都很快，则工作速度很快）以及我们在遍历列表时需要的 equals 方法。

这就是哈希的大致思路，你会从java.util.Map中认识到put和get方法。当然，上面的实施过于简单化，但它应该说明一切的要点。

当然，这种方法不仅限于字符串，它适用于所有类型的对象，因为方法 hashCode() 和 equals 是顶级类 java.lang.String 的成员。 lang.Object 和所有其他类都继承自该类。

正如您所看到的，两个不同的对象在其 hashCode() 方法中返回相同的值并不重要：上述方法始终有效！但仍然希望它们返回不同的值，以降低 h 产生哈希冲突的机会。我们已经看到，这些通常不能 100% 避免，但是冲突越少，哈希表的效率就越高。在最坏的情况下，所有键都映射到相同的数组索引：在这种情况下，所有对都存储在单个列表中，然后查找值将成为成本与哈希表大小呈线性关系的操作。

The idea of a hashtable is that you want to be able to realize a datastructure called a dictionary in an efficient way. A dictionary is a key/value store, i.e., you want to be able to store certain objects under a certain key and later on be able to retrieve them again using the same key.

One of the most efficient ways to access values is to store them in an array. For instance, we could realize a dictionary that uses integers for keys and Strings for values like so:

String[] dictionary = new String[DICT_SIZE];
dictionary[15] = "Hello";
dictionary[121] = "world";

System.out.println(dictionary[15]); // prints "Hello"

Unfortunately, this approach is not very general at all: the index of an array has to be an integer value, but ideally we'd like to be able to use arbitrary kinds of objects for our keys, not only integers.

Now, the way to solve this point is to have a way of mapping arbitrary objects to integer values which we could then use as keys for our array. In Java, that's what hashCode() does. So now, we could try to implement a String->String dictionary:

String[] dictionary = new String[DICT_SIZE];
// "a" -> "Hello"
dictionary["a".hashCode()] = "Hello";

// "b" -> "world"
dictionary["b".hashCode()] = "world";

System.out.println(dictionary["b".hashCode()]); // prints world

But hey, what if there is some object which we'd like to use as a key, but its hashCode method returns a value that's greater than or equal to DICT_SIZE? Then we'd get an ArrayIndexOutOfBoundsException and that would be undesirable. So, let's just make it as big as we can, right?

public static final int DICT_SIZE = Integer.MAX_VALUE // Ooops!

But that would mean that we would have to allocate ginormeous amounts of memory for our array, even if we only intend to store a few items. So that can't be the best solution, and in fact we can do better. Let's assume we had a function h that for any given DICT_SIZE maps arbitrary integers into the range [0, DICT_SIZE[. Then we could just apply h to whatever the hashCode() method of a key object returns and be certain that we stay in the boundaries of the underlying array.

public static int h(int value, int DICT_SIZE) {
    // returns an integer >= 0 and < DICT_SIZE for every value.
}

That function is called a hash function. Now we can adapt our dictionary implementation to avoid the ArrayIndexOutOfBoundsException:

// "a" -> "Hello"
dictionary[h("a".hashCode(), DICT_SIZE)] = "Hello"

// "b" -> "world"
dictionary[h("b".hashCode(), DICT_SIZE)] = "world"

But that introduces another problem: what if h maps two different key indices to the same value? For instance:

int keyA = h("a".hashCode(), DICT_SIZE);
int keyB = h("b".hashCode(), DICT_SIZE);

may yield the same values for keyA and keyB, and in that case we would accidentally overwrite a value in our array:

// "a" -> "Hello"
dictionary[keyA] = "Hello";

// "b" -> "world"
dictionary[keyB] = "world"; // DAMN! This overwrites "Hello"!!

System.out.println(dictionary[keyA]); // prints "world"

Well, you may say, then we just have to make sure that we implement h in such a way that this can never happen. Unfortunately, this isn't possible in general. Consider the following code:

for (int i = 0; i <= DICT_SIZE; i++) {
    dictionary[h(i, DICT_SIZE)] = "dummy";
}

This loop stores DICT_SIZE + 1 values (always the same value, actually, namely the String "dummy") in the dictionary. Mhh, but the array can only store DICT_SIZE different entries! That means, when we use h, we would overwrite (at least) one entry. Or in other words, h will map two different keys to the same value! These "collisions" can't be avoided: if n pigeons try to go into n-1 pigeon holes, at least two of them have to go into the same hole.

But what we can do is to extend our implementation so that the array can store multiple values under the same index. This can easily be done by using lists. So instead of using:

String[] dictionary = new String[DICT_SIZE];

we write:

List<String>[] dictionary = new List<String>[DICT_SIZE];

(Side remark: note that Java doesn't allow the creation of arrays of generic types, so the above line wouldn't compile -- but you get the idea).

That will change the access to the dictionary as follows:

// "a" -> "Hello"
dictionary[h("a".hashCode(), DICT_SIZE)].add("Hello");

// "b" -> "world"
dictionary[h("b".hashCode(), DICT_SIZE)].add("world");

In case our hashfunction h returns different values for all our keys, this will result in lists with only one element each, and retrieving elements is really simple:

System.out.println(dictionary[h("a".hashCode(), DICT_SIZE)].get(0)); // "Hello"

But we already know that in general h will map different keys to the same integer sometimes. In these cases, the lists will contain more than one value. For retrieval, we have to go through the whole list to find the "correct" value, but how would we recognize it?

Well, instead of storing the value alone, we could always store the complete (key,value) pair in the lists. Then lookup would be performed in two steps:

Apply the hashfunction to retrieve the correct list from the array.
Iterate through all pairs stored in the retrieved list: if the pair with the desired key is found, return the value from the pair.

Now adding and retrieving have become so complex that it's not indecent to treat ourselves separate methods for these operations:

List<Pair<String,String>>[] dictionary = List<Pair<String,String>>[DICT_SIZE];

public void put(String key, String value) {
    int hashCode = key.hashCode();
    int arrayIndex = h(hashCode, DICT_SIZE);

    List<Pair<String,String>> listAtIndex = dictionary[arrayIndex];
    if (listAtIndex == null) {
        listAtIndex = new LinkedList<Pair<Integer,String>>();
        dictionary[arrayIndex] = listAtIndex;
    }

    for (Pair<String,String> previouslyAdded : listAtIndex) {
        if (previouslyAdded.getKey().equals(key)) {
            // the key is already used in the dictionary,
            // so let's simply overwrite the associated value
            previouslyAdded.setValue(value);
            return;
        }
    }

    listAtIndex.add(new Pair<String,String>(key, value));
}

public String get(String key) {
    int hashCode = key.hashCode();
    int arrayIndex = h(hashCode, DICT_SIZE);

    List<Pair<String,String>> listAtIndex = dictionary[arrayIndex];
    if (listAtIndex != null) {
        for (Pair<String,String> previouslyAdded : listAtIndex) {
            if (previouslyAdded.getKey().equals(key)) {
                return previouslyAdded.getValue(); // entry found!
            }
        }
    }

    // entry not found
    return null;
}

So, in order for this approach to work, we actually need two comparison operations: the hashCode method to find the list in the array (this works fast if hashCode() and h are both fast) and an equals method which we need when going through the list.

This is the general idea of hashing, and you will recognize the put and get method from java.util.Map. Of course, the above implementation is an oversimplification, but it should illustrate the gist of it all.

Naturally, this approach is not limited to Strings, it works for all kinds of objects, since the methods hashCode() and equals are members of the top-level class java.lang.Object and all other classes inherit from that one.

As you can see, it doesn't really matter if two distinct objects return the same value in their hashCode() method: the above approach will always work! But still it is desirable that they return different values to lower the chances for hash collisions produced by h. We have seen that these can't be avoided 100% in general, but the less collisions we get, the more efficient our hashtable becomes. In the worst case, all keys map to the same array index: in that case, all pairs are stored in a single list and finding a value will then become an operation with costs linear in the size of the hashtable.

回复收藏 0 原文