批评这个 C# Hashmap 实现？

发布于 2024-09-18 02:13:28 字数 5795 浏览 3 评论 0原文

我用 C# 编写了一个哈希图作为自学练习。我想将链接实现为一种碰撞处理技术。起初我以为我只是使用 GetHashCode 作为我的哈希算法，但我很快发现使用 GetHashCode 返回的数字并不总是可行（如果您想按number 和 numeric 可以为负数:()。因此，我想出了一种缩小数字范围的笨拙方法（请参阅 MyGetHashCode）。

有人对这个实现（哈希函数和一般情况）有任何指示/提示/批评吗？提前致谢！

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace HashMap
{
    class Program
    {

        public class MyKVP<T, K>
        {
            public T Key { get; set; }
            public K Value { get; set; }
            public MyKVP(T key, K value)
            {
                Key = key;
                Value = value;
            }
        }


        public class MyHashMap<T, K> : IEnumerable<MyKVP<T,K>>
            where T:IComparable
        {

            private const int map_size = 5000;
            private List<MyKVP<T,K>>[] storage;
            public MyHashMap()
            {
                storage = new List<MyKVP<T,K>>[map_size];
            }

            System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }
            public IEnumerator<MyKVP<T, K>> GetEnumerator()
            {
                foreach (List<MyKVP<T, K>> kvpList in storage)
                {
                    if (kvpList != null)
                    {
                        foreach (MyKVP<T, K> kvp in kvpList)
                        {
                            yield return kvp;
                        }
                    }
                }
            }


            private int MyGetHashCode(T key)
            {
                int i = key.GetHashCode();
                if (i<0) i=i*-1;
                return i / 10000;
            }

            public void Add(T key, K data)
            {
                int value = MyGetHashCode(key);

                SizeIfNeeded(value);

                //is this spot in the hashmap null?
                if (storage[value] == null)
                {
                    //create a new chain
                    storage[value] = new List<MyKVP<T, K>>();
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }
                else
                { 
                    //is this spot taken?
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp != null) //key exists, throw
                    {
                        throw new Exception("This key exists. no soup for you.");
                    }

                    //if we didn't throw, then add us
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }

            }

            private MyKVP<T, K> Find(int value, T key)
            {
                foreach (MyKVP<T, K> kvp in storage[value])
                {
                    if (kvp.Key.CompareTo(key) == 0)
                    {
                        return kvp;
                    }
                }

                return null;
            }

            private void SizeIfNeeded(int value)
            {
                if (value >= storage.Length)
                {
                    List<MyKVP<T, K>>[] temp = storage;
                    storage = new List<MyKVP<T, K>>[value+1];
                    Array.Copy(temp, storage, temp.Length);
                }
            }

            public K this[T key]
            {

                get 
                {
                    int value = MyGetHashCode(key);
                    if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp == null) throw new Exception("key does not exist");
                    return myKvp.Value;
                }
                set 
                {
                    Add(key, value);
                }
            }


            public void Remove(T key)
            {
                int value = MyGetHashCode(key);
                if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                if (storage[value] == null) { throw new IndexOutOfRangeException("Key does not exist."); }

                //loop through each kvp at this hash location
                MyKVP<T, K> myKvp = Find(value, key);
                if (myKvp != null)
                {
                    storage[value].Remove(myKvp);
                }
            }
        }

        static void Main(string[] args)
        {
            MyHashMap<string, int> myHashMap = new MyHashMap<string, int>();
            myHashMap.Add("joe", 1);
            myHashMap.Add("mike", 2);
            myHashMap.Add("adam", 3);
            myHashMap.Add("dad", 4);

            Assert.AreEqual(1, myHashMap["joe"]);
            Assert.AreEqual(4, myHashMap["dad"]);
            Assert.AreEqual(2, myHashMap["mike"]);
            Assert.AreEqual(3, myHashMap["adam"]);

            myHashMap.Remove("joe");

            try 
            {
                if (myHashMap["joe"] == 3) { }; //should throw 
            }
            catch (Exception) 
            {
                try { myHashMap.Add("mike",1); }
                catch (Exception) {

                    foreach (MyKVP<string, int> kvp in myHashMap)
                    { 
                        Console.WriteLine(kvp.Key + " " + kvp.Value.ToString());
                    }


                    return;
                }

            }

            throw new Exception("fail");
        }
    }
}

原文

I wrote a hashmap in C# as a self study exercise. I wanted to implement chaining as a collision handling technique. At first I thought I'd simply use GetHashCode as my hashing algorithm, but I quickly found that use the numbers returned by GetHashCode would not always be viable (size of the int causes a out of mem if you want to index and array by the number and numbers can be negative :(). So, I came up with a kludgey method of narrowing the numbers (see MyGetHashCode).

Does anyone have any pointers/tips/criticism for this implementation (of the hash function and in general)? Thanks in advance!

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace HashMap
{
    class Program
    {

        public class MyKVP<T, K>
        {
            public T Key { get; set; }
            public K Value { get; set; }
            public MyKVP(T key, K value)
            {
                Key = key;
                Value = value;
            }
        }


        public class MyHashMap<T, K> : IEnumerable<MyKVP<T,K>>
            where T:IComparable
        {

            private const int map_size = 5000;
            private List<MyKVP<T,K>>[] storage;
            public MyHashMap()
            {
                storage = new List<MyKVP<T,K>>[map_size];
            }

            System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }
            public IEnumerator<MyKVP<T, K>> GetEnumerator()
            {
                foreach (List<MyKVP<T, K>> kvpList in storage)
                {
                    if (kvpList != null)
                    {
                        foreach (MyKVP<T, K> kvp in kvpList)
                        {
                            yield return kvp;
                        }
                    }
                }
            }


            private int MyGetHashCode(T key)
            {
                int i = key.GetHashCode();
                if (i<0) i=i*-1;
                return i / 10000;
            }

            public void Add(T key, K data)
            {
                int value = MyGetHashCode(key);

                SizeIfNeeded(value);

                //is this spot in the hashmap null?
                if (storage[value] == null)
                {
                    //create a new chain
                    storage[value] = new List<MyKVP<T, K>>();
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }
                else
                { 
                    //is this spot taken?
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp != null) //key exists, throw
                    {
                        throw new Exception("This key exists. no soup for you.");
                    }

                    //if we didn't throw, then add us
                    storage[value].Add(new MyKVP<T, K>(key, data));
                }

            }

            private MyKVP<T, K> Find(int value, T key)
            {
                foreach (MyKVP<T, K> kvp in storage[value])
                {
                    if (kvp.Key.CompareTo(key) == 0)
                    {
                        return kvp;
                    }
                }

                return null;
            }

            private void SizeIfNeeded(int value)
            {
                if (value >= storage.Length)
                {
                    List<MyKVP<T, K>>[] temp = storage;
                    storage = new List<MyKVP<T, K>>[value+1];
                    Array.Copy(temp, storage, temp.Length);
                }
            }

            public K this[T key]
            {

                get 
                {
                    int value = MyGetHashCode(key);
                    if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                    MyKVP<T, K> myKvp = Find(value, key);
                    if (myKvp == null) throw new Exception("key does not exist");
                    return myKvp.Value;
                }
                set 
                {
                    Add(key, value);
                }
            }


            public void Remove(T key)
            {
                int value = MyGetHashCode(key);
                if (value > storage.Length) { throw new IndexOutOfRangeException("Key does not exist."); }
                if (storage[value] == null) { throw new IndexOutOfRangeException("Key does not exist."); }

                //loop through each kvp at this hash location
                MyKVP<T, K> myKvp = Find(value, key);
                if (myKvp != null)
                {
                    storage[value].Remove(myKvp);
                }
            }
        }

        static void Main(string[] args)
        {
            MyHashMap<string, int> myHashMap = new MyHashMap<string, int>();
            myHashMap.Add("joe", 1);
            myHashMap.Add("mike", 2);
            myHashMap.Add("adam", 3);
            myHashMap.Add("dad", 4);

            Assert.AreEqual(1, myHashMap["joe"]);
            Assert.AreEqual(4, myHashMap["dad"]);
            Assert.AreEqual(2, myHashMap["mike"]);
            Assert.AreEqual(3, myHashMap["adam"]);

            myHashMap.Remove("joe");

            try 
            {
                if (myHashMap["joe"] == 3) { }; //should throw 
            }
            catch (Exception) 
            {
                try { myHashMap.Add("mike",1); }
                catch (Exception) {

                    foreach (MyKVP<string, int> kvp in myHashMap)
                    { 
                        Console.WriteLine(kvp.Key + " " + kvp.Value.ToString());
                    }


                    return;
                }

            }

            throw new Exception("fail");
        }
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故人如初 2024-09-25 02:13:28

您的哈希方法是固定范围的。这意味着单个项目可能会导致创建 214748 个存储桶（如果它的哈希码重新哈希为 214747）。更常用（而且几乎总是更好的方法）是从一个已知的初始大小（由于领域知识）开始，该初始大小对于所有值来说都足够大，或者从小开始并让哈希图根据需要调整自身大小。通过重新探测，需要调整大小的明显衡量标准是需要多少重新探测。当您在此处尝试链接时，您将希望保持平均和最大链大小较小。这可以降低最坏情况下的查找时间，从而使平均查找时间更接近最好情况下的 O(1)。

这种散列（以及初始表大小）的两种最常见的方法是使用素数或 2 的幂。前者被认为（尽管在这一点上存在一些争议）提供更好的密钥分配，而后者允许更快的计算（两种情况都对输入哈希进行取模，但已知数字是 2 的幂），模数可以通过二进制与运算快速完成）。在链接时使用 2 的幂的另一个优点是，可以测试链以查看调整哈希大小是否实际上会导致该链被拆分（如果您有一个 8 值表并且有一个链）如果它们的哈希值都是 17、1 或 33，那么将表大小加倍仍会将它们留在同一个链中，但将其加倍则会重新分配它们）。

您没有提供替换语义的方法，这在 .NET 字典类型中很常见（如果已经存在具有该键的项目，则添加将出错，但分配给索引则不会）。

您尝试超出存储桶数量的检索错误对用户来说毫无意义，他们不关心存储桶是否存在，只关心密钥（他们根本不需要知道您的实现是如何工作的）。未找到密钥的两种情况都应该抛出相同的错误（System.Collections.Generic.KeyNotFoundException 具有精确正确的语义，因此您可以重用它。）。

在这种情况下，使用 List 相当繁重。一般来说，我会对任何说 BCL 集合太重的人皱眉，但当涉及到滚动您自己的集合时，通常是因为 (1) 您想从练习中学习或 (2) BCL 集合不适合你的目的。在情况（1）中，您应该学习如何完成您开始的工作，在情况（2）中，您需要确保 List 没有您在 Dictionary< 中发现的任何失败。 /代码>。

对于不了解实现细节的人来说，您的删除既会引发无意义的错误，也会引发不一致的错误（该存储桶中是否存在其他内容不是他们应该关心的事情）。由于删除不存在的项目并无害处，因此更常见的是仅返回一个布尔值来指示该项目是否存在，并让用户决定这是否指示错误。在删除项目后继续搜索整个存储桶也是浪费的。

您的实现现在允许空键，这是足够合理的（事实上，IDictionary 的文档表明实现可能会或可能不会这样做）。但是，拒绝它们的方法是返回由于尝试对 null 调用 GetHashCode() 而导致的 NullReferenceException，而不是检查并抛出 ArgumentNullException< /代码>。如果用户收到 NullReferenceException 则表明集合本身为 null。因此，这是一个明显的错误。

Your hash method is of a fixed range. This means that a single item could cause 214748 buckets to be created (if it's hashcode rehashed to 214747). A more commonly used (and almost always better approach) is to start with an initial size that is either known (due to knowledge of the domain) to be big enough for all values or to start small and have hashmap resize itself as appropriate. With re-probing the obvious measure of a need to resize is how much reprobing was needed. With chaining as you are experimenting with here, you'll want to keep both average and maximum chain sizes down. This keeps down your worse-case lookup time, and hence your average lookup time closer to the best-case O(1).

The two most common approaches to such hashing (and hence to initial table size) is to either use prime numbers or powers of two. The former is considered (though there is some contention on the point) to offer better distribution of keys while the latter allows for faster computation (both cases do a modulo on the input-hash, but with a number known to be a power of 2, the modulo can be quickly done as a binary-and operation). Another advantage of using a power of two when you are chaining, is that its possible to test a chain to see if resizing the hash would actually cause that chain to be split or not (if you have an 8-value table and there's a chain whose hashes are all either 17, 1 or 33 then doubling the table size would still leave them in the same chain, but quadrupling it would re-distribute them).

You don't have a method offering replace semantics, which is usual with .NET dictionary types (where adding will error if there's already an item with that key, but assigning to an index won't).

Your error on a retrieval that would try to go beyond the number of buckets will make no sense to the user, who doesn't care whether the bucket existed or not, only the key (they need not know how your implementation works at all). Both cases where a key isn't found should throw the same error (System.Collections.Generic.KeyNotFoundException has precisely the right semantics, so you could reuse that.).

Using a List is rather heavy in this case. Generally I'd frown on anyone saying a BCL collection was too heavy, but when it comes to rolling your own collections, its generally either because (1) you want to learn from the exercise or (2) the BCL collections don't suit your purposes. In case (1) you should learn how to complete the job you started, and in case (2) you need to be sure that List doesn't have whatever failing you found with Dictionary.

Your removal both throws a nonsensical error for someone who doesn't know about the implementation details, and an inconsistent error (whether something else existed in that bucket is not something they should care about). Since removing a non-existent item isn't harmful it is more common to merely return a bool indicating whether the item had been present or not, and let the user decide if that indicates an error or not. It is also wasteful in continuing to search the entire bucket after the item has been removed.

Your implementation does now allow null keys, which is reasonable enough (indeed, the documentation for IDictionary<TKey, TValue> says that implementations may or may not do so). However, the way you reject them is by having the NullReferenceException caused by trying to call GetHashCode() on null be returned, rather than checking and throwing a ArgumentNullException. For the user to receive a NullReferenceException suggests that the collection itself was null. This is hence a clear bug.

回复收藏 0 原文

寂寞清仓 2024-09-25 02:13:28

Remove 方法永远不应该抛出异常。您正在尝试删除一个项目。如果已将其删除，则不会造成任何损害。 .Net 中的所有集合类都使用 bool 作为返回值来指示某个项目是否确实被删除。
不要抛出异常，抛出特定的异常。浏览 Collection 命名空间中的所有异常以找到合适的异常。
添加 TryGetValue
使用已经是 .Net 一部分的 KeyValuePair，而不是创建您自己的。
添加一个可以定义地图大小的构造函数。
抛出异常时，请包含抛出异常的详细信息。例如，不要写“此键存在”，而是写string.Format(“Key '{0}'已经存在”, key)

回复收藏 0 原文

梦纸 2024-09-25 02:13:28

很遗憾地说，这个类不能作为 HashMap 甚至简单的字典工作。

首先，GetHashCode() 返回的值不是唯一的。两个不同的对象（例如两个字符串）可能返回相同的哈希码值。使用哈希码作为数组索引的想法在哈希码冲突的情况下只会导致记录丢失。我建议阅读有关 GetHashCode() 方法以及如何从 MSDN 实现它的信息。一些明显的例子是，如果您获得从 0 开始的所有可能的 Int64 值的哈希码，则哈希码肯定会在某个时刻发生冲突。

另一件事是，for 循环查找速度很慢。您应该考虑使用二分搜索进行查找。为此，您必须随时维护按键排序的键值对，这意味着您应该使用 List 而不是数组来存储变量，因此在添加新的键值对时，您可以可以将其插入到适当的索引处。

毕竟，请确保当您为真正的哈希图编码时，您意识到不同键的哈希码可以相同，并且永远不要使用 for 循环从 0 到 len-1 进行查找。

回复收藏 0 原文

~没有更多了~