在这种情况下是否可以创建一个最小完美哈希函数?

发布于 2024-12-10 00:03:10 字数 614 浏览 1 评论 0原文

我想创建一个哈希映射(或其他结构,如果您有任何建议)来存储键值对。这些键将在创建地图的同时一次性插入,但我不知道键是什么(任意长度的字符串),直到运行时,当我需要创建地图时。

我正在解析这样的查询字符串 "x=100&name=bob&color=red&y=150" (但该字符串可以有无限数量的变量,并且变量可以有任意长度姓名)。

我想解析一次并创建一个哈希映射,最好是最小的并且具有完美的哈希函数以满足线性存储要求。创建映射后,值将不会被修改或删除,也不会再向映射添加更多键值对,因此整个映射实际上是一个常量。我假设一个变量不会在字符串中出现两次(IE."x=1&x=2" 无效)。

我正在使用 C 进行编码,目前有一个可以使用的函数,例如 get("x") ,它将返回字符串 "100" >,但每次都会解析查询字符串,这需要 O(n) 时间。我想在第一次加载时解析它一次,因为它是一个非常大的查询字符串,并且每个值都会被读取多次。尽管我使用的是 C,但我不需要 C 中的代码作为答案。伪代码,或者任何建议都很棒!

I want to create a Hash Map (or another structure, if you have any suggestions) to store key value pairs. The keys will all be inserted at once at the same time as the map is created, but I don't know what the keys will be (arbitrary length strings) until runtime, when I need to create the map.

I am parsing a query string like this "x=100&name=bob&color=red&y=150" (but the string can have an unlimited number of variables and the variables can have any length name).

I want to parse it once and create a Hash Map, preferably minimal and with a perfect hash function to satisfy linear storage requirements. Once the map is created the values won't be modified or deleted, no more key value pairs will be added to the map either, so the entire map is effectively a constant. I'm assuming that a variable doesn't occur twice in the string (IE. "x=1&x=2" is not valid).

I am coding in C, and currently have a function that I can use like get("x") which will return the string "100", but it parses the query string each time which takes O(n) time. I'd like to parse it once when it is first loaded since it is a very large query string and every value will be read several times. Even though I'm using C, I don't need code in C as an answer. Pseudocode, or any suggestions at all would be awesome!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

把时间冻结 2024-12-17 00:03:10

尝试 GPL 的 gperf,或 Bob Jenkins 在 C 中的公共域实现

过程:

  • 接收查询字符串并通过枚举键列表来识别完美哈希函数的域

  • 向从上述参考实现派生的完美哈希生成函数提供这些键和列表大小(范围为 1..size)

  • 使用生成的完美哈希函数来创建 HashMap

  • 使用相同的完美哈希函数来处理中的 get 请求HashMap

编辑 Necrolis 在下面的评论中指出,参考实现在 C 中输出完美哈希函数源代码,因此您需要修改它们以生成类似于 VM 的字节码之类的内容。您还可以使用解释性语言,例如嵌入式Scheme 或Lua。

有趣的是,当创建完美哈希函数的开销通过查找分摊时,这是否值得在简单(不完美)的 HashMap 上付出努力

另一个选择是 Cuckoo 哈希,它也有 O(1) 查找

Try GPL'd gperf, or Bob Jenkins' public domain implementation in C

Procedure:

  • receive query string and identify domain of perfect hash function by enumerating the list of keys

  • provide these keys and list size (the range will be 1..size) to the perfect hash generation function derived from above reference implementations

  • Use the perfect hash function generated to create the HashMap

  • Use the same perfect hash function to process the get requests in the HashMap

Edit Necrolis noted in the comment below that the reference implementations output perfect hash functions in C source code, so you'll need to modify them to generate something like a bytecode for a VM instead. You could also use an interpretative language like embedded Scheme or Lua.

It would be interesting to know if this is worth the effort over a simple (non-perfect) HashMap when the overhead of creating the perfect hash function is amortized over the lookups

Another option is Cuckoo hashing which also has O(1) lookups

ㄖ落Θ余辉 2024-12-17 00:03:10

有一些非常好的哈希例程;然而,要证明其中一个接近完美,需要对输入有大量了解。看来你的输入不受约束,使得这样的证明几乎不可能。

一般来说,完美(或接近完美)的例程对输入的每个位/字节都很敏感。为了速度,组合运算通常是异或。此类例程防止两个相同字节相互抵消的方法是移位或循环位。然而,这种移位应该由与可表示的最大数互质的数来完成;否则,输入中的模式可能会被先前的输入部分抵消。这减少了溶液中的熵,增加了碰撞的机会。

典型的解决方案是

Start with a number that is a prime (all primes are relative primes)
while (more bytes to be considered) {
  take the next byte of input and multiply it by a second prime
  determine the number of bits that might be lost in a left shift, capture them in a buffer
  shift the bits in the hash "buffer" to the left.
  restore the high order bit(s) in the low position
  take the next byte of input and multiply it by a second prime
  mask the multiplied result into the buffer 
}

这样的例程的问题是已知的。基本上,输入缺乏变化,这使得分散输入并不理想。也就是说,只要有足够的输入偏离初始质数起始数,该技术就可以在整个输出域中提供良好的输入位分散。不幸的是,选择随机起始数并不是一个解决方案,因为这样就不可能准确地重新计算哈希值。

无论如何,乘法中使用的素数不应溢出乘法。同样,如果您想避免丢失初始输入的色散效应(并且结果仅围绕后面的位/字节分组),则必须以低位替换高位的捕获。素数的选择会影响色散,有时需要进行调整才能获得良好的效果。

到现在为止,您应该很容易地看到,一个近乎完美的哈希值比一个像样的不太完美的哈希值需要更多的计算时间。哈希算法的设计考虑到了冲突,并且大多数 Java 哈希结构会在占用阈值(通常在 70% 的范围内,但它是可调的)处调整大小。由于调整大小是内置的,只要您不编写糟糕的散列,Java 数据结构就会继续重新调整您以减少冲突的机会。

可以加速哈希的优化包括对位组进行计算、删除偶尔的字节、预先计算常用乘数的查找表(按输入索引)等。不要假设优化速度更快,具体取决于架构,机器细节和优化的“年龄”,有时优化的假设不再成立,应用优化实际上会增加计算哈希的时间。

There are some very good hashing routines; however, proving one of them to be near-perfect requires a lot of knowledge of the inputs. It seems that your inputs are unconstrained enough to make such a proof near-impossible.

Generally speaking a perfect (or near-perfect) routine is sensitive to each bit/byte of input. For speed, the combination operation is typically XOR. The way that such routines prevent two identical bytes from cancelling each other out is to shift or rotate the bits. However such shifting should be done by a number that is a relative prime to the maximum number that can be represented; otherwise, patterns in the input could partially be cancelled by previous input. This reduces entropy in the solution, increasing chance of collision.

The typical solution is to

Start with a number that is a prime (all primes are relative primes)
while (more bytes to be considered) {
  take the next byte of input and multiply it by a second prime
  determine the number of bits that might be lost in a left shift, capture them in a buffer
  shift the bits in the hash "buffer" to the left.
  restore the high order bit(s) in the low position
  take the next byte of input and multiply it by a second prime
  mask the multiplied result into the buffer 
}

The problems with such a routine are known. Basically there is a lack of variation in the input, and this makes dispersing the input non-ideal. That said, this technique gives a good dispersion of input bits across the entire domain of outputs provided there is sufficient input to wander away from the initial prime starting number. Unfortunately, picking a random starting number is not a solution, as then it becomes impossible to accurately recompute the hash.

In any case, the prime to be used in the multiplication should not overflow the multiplication. Likewise the capturing of high-order bits must be replaced in the low order if you want to avoid losing dispersion effects of the initial input (and the result becoming grouped around the latter bits / bytes only). Prime number selection effects the dispersion, and sometimes tuning is required for good effect.

By now you should easily be able to see that a near-perfect hash takes more computational time than a decent less-than-near-perfect hash. Hash algorithms are designed to account for collision, and most Java hash structures resize at occupancy thresholds (typically in the 70% range, but it is tunable). Since the resizing is built in, as long as you don't write a terrible hash, the Java data structures will continue to retune you into having less of a chance of collision.

Optimizations which can speed a hash include computing on groups of bits, dropping the occasional byte, pre-computing lookup tables of commonly used multiplied numbers (indexed by input), etc. Don't assume that an optimization is faster, depending on architecture, machine details, and "age" of the optimization, sometimes the assumptions of the optimization no longer hold and applying the optimization actually increases the time to compute the hash.

泛滥成性 2024-12-17 00:03:10

您所描述的内容中不存在完美的哈希之类的东西。完美的哈希将是原始输入。如果你保证你的数据只是某些东西(例如基于拉丁语的 ASCII 或只是某些键),那么你可以很好地进行散列,但完美吗?不,不可能。您还必须创建一个链接列表或向量哈希未命中机制。系统中的任何变体(例如您的情况下的输入计数)都会使完美的哈希概念失效。

你想要的东西违反了数学定律。

您可以实现接近 O(1) 的效果,但这里还有一些未解答的问题。问题是:

  1. 为什么必须是线性存储?
  2. 表中的删除是否常见(您仅指定在初始创建后不会添加键值对)?
  3. 与哈希范围相比,表可能会增长到多大?
  4. 与重复数据相比,插入的频率有多高?
  5. 记忆力是一个重要因素吗?

尽管完美的散列是不可能的,但如果您可以简单地拥有一个简单的链表,其存储桶大小至少与潜在唯一散列的平均值相差两个标准差,那么它就完全是学术性的了。它的内存最小(当然是相对而言,取决于总的潜在大小),删除友好,并且只要回答问题 3 类似“小得多”的查找时间几乎 O(1) 。

以下内容应该可以帮助您入门,但我将把使用哪种哈希算法的决定留给您......

#include <stdlib.h>
#include <string.h>
#include <stdint.h>


// Dummy value type to test compile. Replace throughout 
#define YOUR_VALUE_T int

// See below where the charmap is
//#define HTABLE_USE_CHARMAP
// Maintain a true linked list that's manageable and iterateable
#define HTABLE_MAINTAIN_LIST
// Count lookup misses and such
#define HTABLE_KEEP_STATS
// Fast deletion = faster deletion but more memory consumption
//#define HTABLE_FAST_DELETES

#ifdef HTABLE_USE_CHARMAP
// This is used to quickly collapse the input from full 8-bit to the minimal character set of truely expected data.
// The idea here is to boil down the data. This should only be done if you're confident enough to develop a custom
// hashing algorithm for this particular known range
const char hashing_charmap[256] = {
   // Each data point that is unused (such as control characters or high 8-bit characters)
   // should be 0, while each used character should be represented with a unique sequential value (1, 2, 3, etc)
   // I'm not going to build this for you because it's very custom to your needs.
   // A chunk might look look like...
   /*
   0, 0, 0, 0, 17, 18, 19, 0, 0, 20, 21,
   */
};
#endif

static inline uint32_t hash_str(register const char* s, const size_t len) {
   register uint32_t hash = 5381; // hash seed here. This could be different depending on the actual algorithm chosen
   register char symbol;
   // This could be unrolled because we known string length as well.
   for (register size_t i=0; i < len; i++) {
      #ifdef HTABLE_USE_CHARMAP
      if (!(symbol = hash_charmap[s[i]]))
         continue;
      #else
      // Actually s[i] could simply be used (which would be faster) if no mapping is needed.
      symbol = s[i];
      #endif
      // True hash algorithm per-symbol operation here
      /*
      Keep in mind that certain algorithms are optimized for certain things.
      An example:
      Stock DJBX33A is very fast but effectively only represents the end of a long input. It's really meant for short inputs (like variable names)
      A MurmurHash or tuned FNV variant are likely to be a good picks since we've reduced symbol range and we are dealing with potential long inputs.
      It's also important to understand that the entire hash will likely not be used. Only the lower-end bits will be used
      (you'll see why in the actual functionality). If you're hashing algorithm is good though, this shouldn't matter because
      the distribution should be normal.
      I'll just use Jenkins one-at-a-time hash here (because it's easy)
      */
      hash += symbol;
      hash += (hash << 10);
      hash ^= (hash >> 6);
   }
   // Finialize jenkins one-at-a-time
   hash += (hash << 3);
   hash ^= (hash >> 11);
   hash += (hash << 15);
   return hash;
};


typedef struct _hash_entry {
   char* key;
   size_t key_len;
   uint32_t hash;
   // Whatever your value type is (likely a pointer to your own record or something)
   YOUR_VALUE_T value;
   // Internal linking maintains order.
   // If you don't need proper order maintentence, you don't need these
   #ifdef HTABLE_MAINTAIN_LIST
   struct _hash_entry* prev;
   struct _hash_entry* next;
   #endif

   #ifdef HTABLE_FAST_DELETES
   struct _hash_entry* bucket_prev;
   #endif
   // This is required for the occassional hash miss
   struct _hash_entry* bucket_next;
} hash_entry_t;


typedef struct _hash_table {
   // Counts
   size_t entry_count;
   uint32_t bucket_count;
   unsigned int growth_num;
   unsigned int growth_den;
   #ifdef HTABLE_KEEP_STATS
   // How many times we missed during lookup
   size_t misses;
   // (entry_count - used_buckets) tells you how many collisions there are (the lower the better)
   uint32_t used_buckets;
   #endif
   // Internal linking. Same conditions as in hash_entry_t so feel free to remove as necessary.
   #ifdef HTABLE_MAINTAIN_LIST
   hash_entry_t* first;
   hash_entry_t* last;
   #endif
   // Buckets, the soul of the hash table
   uint32_t hash_mask;
   hash_entry_t** buckets;
} hash_table_t;



// Creates a hash table
// size_hint - Tells to table how many buckets it should initially allocate.
//    If you know (for example) that you'll have about 500 entries, set it
//    to 500
// growth_num and growth_den - This is the ratio of how many entries to how
//    many buckets that you want to guarantee.
//    It's in two integers to avoid floating point math for speed.
//    The logic after an insertion is...
//       if (entry_count == growth_num * (bucket_count / growth_den)) then
//          grow the bucket array
//    For example, when growth_num is 4 and growth_den is 5...
//       (entry_count == 4 * (bucket_count / 5))
//   ...would be true when entry count is 80% of the bucket count
//    This can result in a greater than 1.0 ratio (such as 5/4 or something
//    like that) if you prefer. This would mean that there are less buckets
//    than there are entries, so collisions are guaranteed at that point, but
//    you would save on both memory and often a bucket expansion occurs (which
//    is costly during an insert operation).
static hash_table_t* htable_create(const size_t size_hint, const unsigned int growth_num, const unsigned int growth_den);
// Frees a hash table
static void htable_free(hash_table_t* table);
// Mostly used internally. You probably want htable_get(), htable_value(), or htable_exists()
static hash_entry_t* htable_find_entry(hash_table_t* table, const char* key, size_t key_len, uint32_t* hash, size_t* true_len);
// Get the pointer to a value stored in the table (or NULL on non-existant)
static YOUR_VALUE_T* htable_value(const hash_table_t* table, const char* key, size_t key_len);
// Get the value of an entry, or the default value if the entry doesn't exist
static YOUR_VALUE_T htable_get(const hash_table_t* table, const char* key, size_t key_len, const YOUR_VALUE_T default_value);
// Test for the existance of a value
static int htable_exists(const hash_table_t* table, const char* key, size_t key_len);
// Add a new entry (but don't update if it already exists). Returns NULL if it already exists
static hash_entry_t* htable_add(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Update an entry OR add a a new entry it doesn't already exist
static hash_entry_t* htable_set(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Update an entry but don't add a a new entry it doesn't already exist. Returns NULL if doesn't exist
static hash_entry_t* htable_update(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Delete an entry. Returns 1 on success or 0 if the entry didn't exist
static int htable_delete(hash_table_t* table, const char* key, size_t key_len);
// Pack the table.
// This is here because...
// - If HTABLE_FAST_DELETES is set, and if you delete a bunch of entries, it's
//   possible that you can free up some memory by shrinking the bucket array.
//   You would have to call this manually to make that happen.
// - If HTABLE_FAST_DELETES is NOT set however, this get's called automatically
//   on each delete, so the buckets are guaranteed to be packed.
static void htable_pack(hash_table_t* table);



/*********************************\
Implementation...
\*********************************/
static hash_table_t* htable_create(const unsigned long size_hint, const unsigned int growth_num, const unsigned int growth_den) {
   hash_table_t* res = malloc(sizeof(hash_table_t));
   if (!res)
      return NULL;
   res->entry_count = 0;
   #ifdef HTABLE_MAINTAIN_LIST
   res->first = NULL;
   res->last = NULL;
   #endif

   #ifdef HTABLE_KEEP_STATS
   res->misses = 0; 
   res->used_buckets = 0;
   #endif
   if ((!growth_num) || (!growth_den)) {
      // Grow only when the entry count matches the bucket count
      res->growth_num = 1;
      res->growth_den = 1;
   } else {
      res->growth_num = growth_num;
      res->growth_den = growth_den;
   }
   /*
   For computational speed and simplicity we'll grow the bucket array exponentially.
   Not growing the buckets exponentially is possible but requires a different
   entry lookup mechanism (because hash & hash_mask would no longer work) and would 
   likely involve the modulas operator which is very slow. If memory is uber important
   however, this might be a good solution.
   */
   // We'll go ahead and assume it's a reasonably small table and only allocate 256 buckets.
   int bits = 8;
   if (size_hint) {
      unsigned long target = (size_hint * res->growth_den) / res->growth_num;
      // First check is to prevent overflow as it would be 0 when bits is 31 on a 32 bit system
      while ((1 << (bits + 1)) && ((1 << bits) < target))
         bits++;
   }
   res->bucket_count = 1 << bits;
   res->hash_mask = (1 << bits) - 1;
   if ((res->buckets = (hash_entry_t**)calloc(res->bucket_count, sizeof(hash_entry_t*))) == NULL) {
      free(res);
      return NULL;
   }
   memset(res->buckets, 0, sizeof(hash_entry_t*) * res->bucket_count);
   return res;
};

// Destroy a table
static void htable_free(hash_table_t* table) {
   hash_entry_t* entry;
   hash_entry_t* next;
   #ifdef HTABLE_MAINTAIN_LIST
      entry = table->first;
      while (entry) {
         next = entry->next;
         free(entry->key);
         free(entry);
         entry = next;
      }
   #else
      for (uint32_t i=0; i < table->bucket_count; i++) {
         entry = table->buckets[i];
         while (entry) {
            next = entry->bucket_next;
            free(entry->key);
            free(entry);
            entry = next;
         }
      }
   #endif
   free(table->buckets);
   free(table);
}

// Find an entry: (mostly used internally)
// returns NULL when the entry isn't found
static hash_entry_t* htable_find_entry(hash_table_t* table, const char* key, size_t key_len, uint32_t* hash, size_t* true_len) {
   if (!key_len)
      key_len = strlen(key);
   if (true_len != NULL)
      *true_len = key_len;
   uint32_t h = hash_str(key, key_len);
   if (hash != NULL)
      *hash = h;
   uint32_t bucket = h & table->hash_mask;
   // Best case is here is O(1) because table->buckets[bucket] would be the entry
   hash_entry_t* entry = table->buckets[bucket];
   // ... but if we miss, then the time increases to as much as O(n) where n is the number of entries in
   // the particular bucket (good hash + good ratio management means that n would usually be only 1)
   while ((entry) && ((entry->hash != h) || (entry->key_len != key_len) || (memcmp(entry->key, key, key_len)))) {
      #ifdef HTABLE_KEEP_STATS
      table->misses++;
      #endif
      entry = entry->bucket_next;
   }
   return entry;
}


// Insertion of entry into bucket. Used internally
static inline int _htable_bucket_insert(hash_entry_t** buckets, hash_entry_t* entry, const uint32_t hash_mask) {
   hash_entry_t* bentry;
   #ifdef HTABLE_FAST_DELETES
      entry->bucket_prev = NULL;
   #endif
   entry->bucket_next = NULL;
   uint32_t bidx = entry->hash & hash_mask;
   int res = 0;
   if ((bentry = buckets[bidx]) == NULL) {
      res = 1;
      buckets[bidx] = entry;
   } else {
      while (bentry->bucket_next)
         bentry = bentry->bucket_next;
      bentry->bucket_next = entry;
      #ifdef HTABLE_FAST_DELETES
         entry->bucket_prev = bentry;
      #endif
   }
   return res;
}

// Bucket array growing/shrinking. Used internally
static void _htable_adjust_as_needed(hash_table_t* table) {
   int change = (((table->bucket_count << 1) != 0) && (table->entry_count >= table->growth_num * (table->bucket_count / table->growth_den)));
   if (!change) {
      if ((table->bucket_count > (1 << 8)) && (table->entry_count < table->growth_num * ((table->bucket_count >> 1) / table->growth_den))) {
         change = -1;
      } else {
         return;
      }
   }
   uint32_t new_bucket_count = (change < 0) ? table->bucket_count >> 1 : table->bucket_count << 1;
   uint32_t new_hash_mask = new_bucket_count - 1;
   hash_entry_t** new_buckets = (hash_entry_t**)calloc(new_bucket_count, sizeof(hash_entry_t*));
   if (!new_buckets)
      return;
   memset(new_buckets, 0, new_bucket_count * sizeof(hash_entry_t*));
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets = 0;
   #endif
   hash_entry_t* entry;
   #ifdef HTABLE_MAINTAIN_LIST
      entry = table->first;
      while (entry) {
         int r = _htable_bucket_insert(new_buckets, entry, new_hash_mask);
         #ifdef HTABLE_KEEP_STATS
         table->used_buckets += r;
         #endif
         entry = entry->next;
      }
   #else
      hash_entry_t* next;
      for (uint32_t i=0; i < table->bucket_count; i++) {
         entry = table->buckets[i];
         while (entry) {
            next = entry->bucket_next;
            int r = _htable_bucket_insert(new_buckets, entry, new_hash_mask);
            #ifdef HTABLE_KEEP_STATS
            table->used_buckets += r;
            #endif
            entry = next;
         }
      }
   #endif
   free(table->buckets);
   table->buckets = new_buckets;
   table->bucket_count = new_bucket_count;
   table->hash_mask = new_hash_mask;
}


// Get the pointer to the value of the entry or NULL if not in table
static YOUR_VALUE_T* htable_value(const hash_table_t* table, const char* key, size_t key_len) {
   // un-const table so that find_entry can keep statistics
   hash_entry_t* entry = htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL);
   return (entry != NULL) ? &entry->value : NULL;
}

static YOUR_VALUE_T htable_get(const hash_table_t* table, const char* key, size_t key_len, const YOUR_VALUE_T default_value) {
   // un-const table so that find_entry can keep statistics
   hash_entry_t* entry = htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL);
   return (entry != NULL) ? entry->value : default_value;
}

static int htable_exists(const hash_table_t* table, const char* key, size_t key_len) {
   // un-const table so that find_entry can keep statistics
   return (htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL) != NULL);
}

// Add a new entry (but don't update if it already exists)
// Returns NULL if the entry already exists (use set() if you want add or update logic)
static hash_entry_t* htable_add(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   uint32_t hash;
   hash_entry_t* res = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (res != NULL)
      return NULL;
   if ((res = (hash_entry_t*)malloc(sizeof(hash_entry_t))) == NULL)
      return NULL;
   if ((res->key = (char*)malloc(key_len + 1)) == NULL) {
      free(res);
      return NULL;
   }
   memcpy(res->key, key, key_len + 1);
   res->key_len = key_len;
   res->hash = hash;
   res->value = value;
   #ifdef HTABLE_MAINTAIN_LIST
   res->prev = NULL;
   res->next = NULL;
   if (table->first == NULL) {
      table->first = res;
      table->last = res;
   } else {
      res->prev = table->last;
      table->last->next = res;
      table->last = res;
   }
   #endif
   int r = _htable_bucket_insert(table->buckets, res, table->hash_mask);
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets += r;
   #endif
   table->entry_count++;
   _htable_adjust_as_needed(table);
   return res;
}


static hash_entry_t* htable_set(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   uint32_t hash;
   hash_entry_t* res = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (res != NULL) {
      res->value = value;
      return res;
   }
   if ((res = (hash_entry_t*)malloc(sizeof(hash_entry_t))) == NULL)
      return NULL;
   if ((res->key = (char*)malloc(key_len + 1)) == NULL) {
      free(res);
      return NULL;
   }
   memcpy(res->key, key, key_len + 1);
   res->key_len = key_len;
   res->hash = hash;
   res->value = value;
   #ifdef HTABLE_MAINTAIN_LIST
   res->prev = NULL;
   res->next = NULL;
   if (table->first == NULL) {
      table->first = res;
      table->last = res;
   } else {
      res->prev = table->last;
      table->last->next = res;
      table->last = res;
   }
   #endif
   int r = _htable_bucket_insert(table->buckets, res, table->hash_mask);
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets += r;
   #endif
   table->entry_count++;
   _htable_adjust_as_needed(table);
   return res;
}

// Update an entry but don't add a a new entry it doesn't already exist. Returns NULL if doesn't exist
static hash_entry_t* htable_update(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   hash_entry_t* res = htable_find_entry(table, key, key_len, NULL, NULL);
   if (res == NULL)
      return NULL;
   res->value = value;
   return res;
}

// Delete an entry. Returns 1 on success or 0 if the entry didn't exist
static int htable_delete(hash_table_t* table, const char* key, size_t key_len) {
   uint32_t hash;
   hash_entry_t* entry = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (entry == NULL)
      return 0;

   #ifdef HTABLE_MAINTAIN_LIST
      if (entry == table->first)
         table->first = entry->next;
      if (entry == table->last) {
         table->last = entry->prev;
      }
      if (entry->prev != NULL)
         entry->prev->next = entry->next;
      if (entry->next != NULL)
         entry->next->prev = entry->prev;
   #endif

   uint32_t bucket = hash & table->hash_mask;
   hash_entry_t* bhead = table->buckets[bucket];
   hash_entry_t* bprev = NULL;
   #ifdef HTABLE_FAST_DELETES
      bprev = entry->bucket_prev;
   #else
      if (bhead != entry) {
         bprev = bhead;
         while (bprev->bucket_next != entry)
            bprev = bprev->bucket_next;
      }
   #endif
   if (bprev != NULL)
      bprev->bucket_next = entry->bucket_next;

   #ifdef HTABLE_FAST_DELETES
      if (entry->bucket_next != NULL)
         entry->bucket_next->bucket_prev = entry->bucket_next;
   #endif

   if (bhead == entry) {
      table->buckets[bucket] = entry->bucket_next;
      #ifdef HTABLE_KEEP_STATS
         if (entry->bucket_next == NULL)
            table->used_buckets--;
      #endif
   }

   free(entry->key);
   free(entry);
   table->entry_count--;

   #ifndef HTABLE_FAST_DELETES
      htable_pack(table);
   #endif
   return 1;
}

static void htable_pack(hash_table_t* table) {
   _htable_adjust_as_needed(table);
}

使用示例(作为断言)和效率测试。使用 int 作为数据值类型...

hash_table_t* ht = htable_create(0, 0, 0);
assert(ht != NULL);  // Table was created successfully

// Testing basic adding/updating/getting...
assert(htable_add(ht, "hello-world", 0, 234) != NULL); // hello-world set to 234
assert(htable_add(ht, "goodbye-world", 0, 567) != NULL); // goobye-world set to 567
assert(ht->entry_count == 2); // Two entries exist (hello-world and goodbye-world)
assert(htable_exists(ht, "hello-world", 0) == 1); // hello-world exists
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goodbye-world exists
assert(htable_exists(ht, "unknown-world", 0) == 0); // unknown-world doesn't exist
assert(*htable_value(ht, "hello-world", 0) == 234); // hello-world has a value of 234
assert(*htable_value(ht, "goodbye-world", 0) == 567); // goodbye-world has a value of 567
assert(htable_get(ht, "hello-world", 0, -1) == 234); // hello-world exists and has a value of 234
assert(htable_get(ht, "goodbye-world", 0, -1) == 567); // goobye-world exists and has a value of 567
assert(htable_get(ht, "unknown-world", 0, -1) == -1); // unknown-world does not exist so the default value of -1 is returned
*htable_value(ht, "hello-world", 0) = -1; // hello-world's value is directly set via reference to -1
*htable_value(ht, "goodbye-world", 0) = -2; // goodbye-world's value is directly set via reference to -2
assert(*htable_value(ht, "hello-world", 0) == -1); // hello-world has a value of -1
assert(*htable_value(ht, "goodbye-world", 0) == -2); // goodbye-world has a value of -2
assert(htable_update(ht, "hello-world", 0, 1000) != NULL); // hello-world set to 1000
assert(htable_update(ht, "goodbye-world", 0, 2000) != NULL); // goodbye-world set to 2000
assert(htable_update(ht, "unknown-world", 0, 3000) == NULL); // unknown-world not set (it doesn't exist);
assert(ht->entry_count == 2); // Two entries exist (hello-world and goodbye-world)
assert(htable_set(ht, "hello-world", 0, 1111) != NULL); // hello-world set to 1111
assert(htable_set(ht, "goodbye-world", 0, 2222) != NULL); // goodbye-world set to 2222
assert(htable_set(ht, "unknown-world", 0, 3333) != NULL); // unknown-world added with a value of 3333
assert(ht->entry_count == 3); // Three entries exist (hello-world, goodbye-world, and unknown-world)
printf("%s\n", "After all additions and changes:");
#ifdef HTABLE_MAINTAIN_LIST
// A foreach iteration
hash_entry_t* entry = ht->first;
while (entry != NULL) {
   printf("\"%s\" = %i\n", entry->key, entry->value);
   entry = entry->next;
}
#endif
#ifdef HTABLE_KEEP_STATS
assert(ht->entry_count - ht->used_buckets == 0); // Means that no hash collisions occured
assert(ht->misses == 0); // Means that each lookup was in O(1) time
#endif

// Testing basic deletion...
assert(htable_delete(ht, "not-a-world", 0) == 0); // not-a-world not deleted (doesn't exist)
assert(htable_delete(ht, "hello-world", 0) == 1); // hello-world deleted
assert(htable_delete(ht, "hello-world", 0) == 0); // hello-world not deleted (doesn't exist)
assert(htable_exists(ht, "hello-world", 0) == 0); // hello-world doesn't exit
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goobye-world still exists
assert(htable_exists(ht, "unknown-world", 0) == 1); // unknown-world still exists
assert(ht->entry_count == 2); // Two entries exists (goodbye-world and unknown-world)
assert(htable_delete(ht, "unknown-world", 0) == 1); // unknown-world deleted
assert(htable_exists(ht, "unknown-world", 0) == 0); // unknown-world doesn't exit
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goodbye-world still exists
assert(ht->entry_count == 1); // One entry exists (goodbye-world)
#ifdef HTABLE_MAINTAIN_LIST
// A foreach iteration
printf("%s\n", "After deletion:");
entry = ht->first;
while (entry != NULL) {
   printf("\"%s\" = %i\n", entry->key, entry->value);
   entry = entry->next;
}
#endif

#ifdef HTABLE_KEEP_STATS
assert(ht->entry_count - ht->used_buckets == 0); // Means that no hash collisions occured
assert(ht->misses == 0); // Means that each lookup was in O(1) time
#endif

htable_free(ht);

此外,我使用 100,000 个随机生成的 ASCII 键(长度在 5 到 1000 个字符之间)进行了一些测试,显示了以下内容...

  • 使用 default 进行随机输入后参数:
    • 条目数:100000
    • 存储桶:131072
    • 已用桶:69790
    • 碰撞次数:30210
    • 未命中:71394
    • 哈希/存储桶效率:69.79%

  • 使用增长比率 1/2 随机条目后强>:
    • 条目数:100000
    • 存储桶:262144
    • 已用桶:83181
    • 碰撞次数:16819
    • 未命中:35436
    • 哈希/存储桶效率:83.18%
  • 使用增长比率 2/1 随机条目后强>:
    • 条目数:100000
    • 存储桶:65536
    • 已用桶:51368
    • 碰撞次数:48632
    • 未命中次数:141607
    • 哈希/存储桶效率:51.37%

正如您所看到的,它有潜力表现得相当好。 80% 的效率意味着大约 80% 的查找为 O(1),大约 16% 的查找为 O(2),大约 3.2% 的查找为 O(3),大约 0.8% 的查找为 O(3)。 O(4+)。这意味着平均一次查找将花费 O(1.248)

同样,50% 的效率意味着 50% 的查找为 O(1),25% 为 O(2),12.5% 为 O(2)。 O(3),12.5% 是 O(4+)

您实际上只需要为您的已知因素选择(或编写)正确的哈希算法,并根据您的特定需求进行调整。

注意:

  1. 这些断言/测试对我有用,但不能保证没有错误。看起来确实相当稳定。那里可能漂浮着一两个虫子。
  2. 如果您需要列表管理,您可以轻松添加 move()swap()sort()insert() 等内容 等,通过管理 entry->preventry->next
  3. 我无法包含测试代码,因为我似乎已经点击了答案大小限制。
  4. 时间分析中不包含哈希函数或最终字符串比较。如果不知道有关输入的所有统计数据,就不可能进行分析。不过,这两个函数都应该非常快,并且如果了解有关输入数据的更多信息,则可以完全排除字符串比较。

There's no such thing as a perfect hash in what you're describing. A perfect hash would be the original input. If you're guaranteed that your data will only be certain things (such as latin based ASCII or only certain keys) then you can hash well, but perfect? No. Not possible. You have to create a link-list or vector hash miss mechanism as well. Any varient in the system (like count of inputs in your case) will invalidate the perfect hash concept.

What you want defies the laws of math.

You can achieve near O(1) but there's unanswered questions here. The questions are:

  1. Why does it have to be linear storage?
  2. Are deletions from the table common (you only specified that key value pairs wouldn't be added after initial creation)?
  3. How large is the table likely to grow compared to the range of the hash?
  4. How frequent are insertions compared to repeating data?
  5. Is memory an important factor?

Although a perfect hash isn't possible, it becomes entirely academic if you can simply have a simple linked list with a bucket size that is at least two standard deviations out from the mean of your potential unique hashes. It's minimal memory (relatively speaking of course and depending on total potential size), deletion friendly, and would be nearly O(1) lookup time as long as question 3 is answered something like, "far smaller".

The following should get you started but I'll leave decisions about which hash algorithm to use up to you...

#include <stdlib.h>
#include <string.h>
#include <stdint.h>


// Dummy value type to test compile. Replace throughout 
#define YOUR_VALUE_T int

// See below where the charmap is
//#define HTABLE_USE_CHARMAP
// Maintain a true linked list that's manageable and iterateable
#define HTABLE_MAINTAIN_LIST
// Count lookup misses and such
#define HTABLE_KEEP_STATS
// Fast deletion = faster deletion but more memory consumption
//#define HTABLE_FAST_DELETES

#ifdef HTABLE_USE_CHARMAP
// This is used to quickly collapse the input from full 8-bit to the minimal character set of truely expected data.
// The idea here is to boil down the data. This should only be done if you're confident enough to develop a custom
// hashing algorithm for this particular known range
const char hashing_charmap[256] = {
   // Each data point that is unused (such as control characters or high 8-bit characters)
   // should be 0, while each used character should be represented with a unique sequential value (1, 2, 3, etc)
   // I'm not going to build this for you because it's very custom to your needs.
   // A chunk might look look like...
   /*
   0, 0, 0, 0, 17, 18, 19, 0, 0, 20, 21,
   */
};
#endif

static inline uint32_t hash_str(register const char* s, const size_t len) {
   register uint32_t hash = 5381; // hash seed here. This could be different depending on the actual algorithm chosen
   register char symbol;
   // This could be unrolled because we known string length as well.
   for (register size_t i=0; i < len; i++) {
      #ifdef HTABLE_USE_CHARMAP
      if (!(symbol = hash_charmap[s[i]]))
         continue;
      #else
      // Actually s[i] could simply be used (which would be faster) if no mapping is needed.
      symbol = s[i];
      #endif
      // True hash algorithm per-symbol operation here
      /*
      Keep in mind that certain algorithms are optimized for certain things.
      An example:
      Stock DJBX33A is very fast but effectively only represents the end of a long input. It's really meant for short inputs (like variable names)
      A MurmurHash or tuned FNV variant are likely to be a good picks since we've reduced symbol range and we are dealing with potential long inputs.
      It's also important to understand that the entire hash will likely not be used. Only the lower-end bits will be used
      (you'll see why in the actual functionality). If you're hashing algorithm is good though, this shouldn't matter because
      the distribution should be normal.
      I'll just use Jenkins one-at-a-time hash here (because it's easy)
      */
      hash += symbol;
      hash += (hash << 10);
      hash ^= (hash >> 6);
   }
   // Finialize jenkins one-at-a-time
   hash += (hash << 3);
   hash ^= (hash >> 11);
   hash += (hash << 15);
   return hash;
};


typedef struct _hash_entry {
   char* key;
   size_t key_len;
   uint32_t hash;
   // Whatever your value type is (likely a pointer to your own record or something)
   YOUR_VALUE_T value;
   // Internal linking maintains order.
   // If you don't need proper order maintentence, you don't need these
   #ifdef HTABLE_MAINTAIN_LIST
   struct _hash_entry* prev;
   struct _hash_entry* next;
   #endif

   #ifdef HTABLE_FAST_DELETES
   struct _hash_entry* bucket_prev;
   #endif
   // This is required for the occassional hash miss
   struct _hash_entry* bucket_next;
} hash_entry_t;


typedef struct _hash_table {
   // Counts
   size_t entry_count;
   uint32_t bucket_count;
   unsigned int growth_num;
   unsigned int growth_den;
   #ifdef HTABLE_KEEP_STATS
   // How many times we missed during lookup
   size_t misses;
   // (entry_count - used_buckets) tells you how many collisions there are (the lower the better)
   uint32_t used_buckets;
   #endif
   // Internal linking. Same conditions as in hash_entry_t so feel free to remove as necessary.
   #ifdef HTABLE_MAINTAIN_LIST
   hash_entry_t* first;
   hash_entry_t* last;
   #endif
   // Buckets, the soul of the hash table
   uint32_t hash_mask;
   hash_entry_t** buckets;
} hash_table_t;



// Creates a hash table
// size_hint - Tells to table how many buckets it should initially allocate.
//    If you know (for example) that you'll have about 500 entries, set it
//    to 500
// growth_num and growth_den - This is the ratio of how many entries to how
//    many buckets that you want to guarantee.
//    It's in two integers to avoid floating point math for speed.
//    The logic after an insertion is...
//       if (entry_count == growth_num * (bucket_count / growth_den)) then
//          grow the bucket array
//    For example, when growth_num is 4 and growth_den is 5...
//       (entry_count == 4 * (bucket_count / 5))
//   ...would be true when entry count is 80% of the bucket count
//    This can result in a greater than 1.0 ratio (such as 5/4 or something
//    like that) if you prefer. This would mean that there are less buckets
//    than there are entries, so collisions are guaranteed at that point, but
//    you would save on both memory and often a bucket expansion occurs (which
//    is costly during an insert operation).
static hash_table_t* htable_create(const size_t size_hint, const unsigned int growth_num, const unsigned int growth_den);
// Frees a hash table
static void htable_free(hash_table_t* table);
// Mostly used internally. You probably want htable_get(), htable_value(), or htable_exists()
static hash_entry_t* htable_find_entry(hash_table_t* table, const char* key, size_t key_len, uint32_t* hash, size_t* true_len);
// Get the pointer to a value stored in the table (or NULL on non-existant)
static YOUR_VALUE_T* htable_value(const hash_table_t* table, const char* key, size_t key_len);
// Get the value of an entry, or the default value if the entry doesn't exist
static YOUR_VALUE_T htable_get(const hash_table_t* table, const char* key, size_t key_len, const YOUR_VALUE_T default_value);
// Test for the existance of a value
static int htable_exists(const hash_table_t* table, const char* key, size_t key_len);
// Add a new entry (but don't update if it already exists). Returns NULL if it already exists
static hash_entry_t* htable_add(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Update an entry OR add a a new entry it doesn't already exist
static hash_entry_t* htable_set(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Update an entry but don't add a a new entry it doesn't already exist. Returns NULL if doesn't exist
static hash_entry_t* htable_update(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Delete an entry. Returns 1 on success or 0 if the entry didn't exist
static int htable_delete(hash_table_t* table, const char* key, size_t key_len);
// Pack the table.
// This is here because...
// - If HTABLE_FAST_DELETES is set, and if you delete a bunch of entries, it's
//   possible that you can free up some memory by shrinking the bucket array.
//   You would have to call this manually to make that happen.
// - If HTABLE_FAST_DELETES is NOT set however, this get's called automatically
//   on each delete, so the buckets are guaranteed to be packed.
static void htable_pack(hash_table_t* table);



/*********************************\
Implementation...
\*********************************/
static hash_table_t* htable_create(const unsigned long size_hint, const unsigned int growth_num, const unsigned int growth_den) {
   hash_table_t* res = malloc(sizeof(hash_table_t));
   if (!res)
      return NULL;
   res->entry_count = 0;
   #ifdef HTABLE_MAINTAIN_LIST
   res->first = NULL;
   res->last = NULL;
   #endif

   #ifdef HTABLE_KEEP_STATS
   res->misses = 0; 
   res->used_buckets = 0;
   #endif
   if ((!growth_num) || (!growth_den)) {
      // Grow only when the entry count matches the bucket count
      res->growth_num = 1;
      res->growth_den = 1;
   } else {
      res->growth_num = growth_num;
      res->growth_den = growth_den;
   }
   /*
   For computational speed and simplicity we'll grow the bucket array exponentially.
   Not growing the buckets exponentially is possible but requires a different
   entry lookup mechanism (because hash & hash_mask would no longer work) and would 
   likely involve the modulas operator which is very slow. If memory is uber important
   however, this might be a good solution.
   */
   // We'll go ahead and assume it's a reasonably small table and only allocate 256 buckets.
   int bits = 8;
   if (size_hint) {
      unsigned long target = (size_hint * res->growth_den) / res->growth_num;
      // First check is to prevent overflow as it would be 0 when bits is 31 on a 32 bit system
      while ((1 << (bits + 1)) && ((1 << bits) < target))
         bits++;
   }
   res->bucket_count = 1 << bits;
   res->hash_mask = (1 << bits) - 1;
   if ((res->buckets = (hash_entry_t**)calloc(res->bucket_count, sizeof(hash_entry_t*))) == NULL) {
      free(res);
      return NULL;
   }
   memset(res->buckets, 0, sizeof(hash_entry_t*) * res->bucket_count);
   return res;
};

// Destroy a table
static void htable_free(hash_table_t* table) {
   hash_entry_t* entry;
   hash_entry_t* next;
   #ifdef HTABLE_MAINTAIN_LIST
      entry = table->first;
      while (entry) {
         next = entry->next;
         free(entry->key);
         free(entry);
         entry = next;
      }
   #else
      for (uint32_t i=0; i < table->bucket_count; i++) {
         entry = table->buckets[i];
         while (entry) {
            next = entry->bucket_next;
            free(entry->key);
            free(entry);
            entry = next;
         }
      }
   #endif
   free(table->buckets);
   free(table);
}

// Find an entry: (mostly used internally)
// returns NULL when the entry isn't found
static hash_entry_t* htable_find_entry(hash_table_t* table, const char* key, size_t key_len, uint32_t* hash, size_t* true_len) {
   if (!key_len)
      key_len = strlen(key);
   if (true_len != NULL)
      *true_len = key_len;
   uint32_t h = hash_str(key, key_len);
   if (hash != NULL)
      *hash = h;
   uint32_t bucket = h & table->hash_mask;
   // Best case is here is O(1) because table->buckets[bucket] would be the entry
   hash_entry_t* entry = table->buckets[bucket];
   // ... but if we miss, then the time increases to as much as O(n) where n is the number of entries in
   // the particular bucket (good hash + good ratio management means that n would usually be only 1)
   while ((entry) && ((entry->hash != h) || (entry->key_len != key_len) || (memcmp(entry->key, key, key_len)))) {
      #ifdef HTABLE_KEEP_STATS
      table->misses++;
      #endif
      entry = entry->bucket_next;
   }
   return entry;
}


// Insertion of entry into bucket. Used internally
static inline int _htable_bucket_insert(hash_entry_t** buckets, hash_entry_t* entry, const uint32_t hash_mask) {
   hash_entry_t* bentry;
   #ifdef HTABLE_FAST_DELETES
      entry->bucket_prev = NULL;
   #endif
   entry->bucket_next = NULL;
   uint32_t bidx = entry->hash & hash_mask;
   int res = 0;
   if ((bentry = buckets[bidx]) == NULL) {
      res = 1;
      buckets[bidx] = entry;
   } else {
      while (bentry->bucket_next)
         bentry = bentry->bucket_next;
      bentry->bucket_next = entry;
      #ifdef HTABLE_FAST_DELETES
         entry->bucket_prev = bentry;
      #endif
   }
   return res;
}

// Bucket array growing/shrinking. Used internally
static void _htable_adjust_as_needed(hash_table_t* table) {
   int change = (((table->bucket_count << 1) != 0) && (table->entry_count >= table->growth_num * (table->bucket_count / table->growth_den)));
   if (!change) {
      if ((table->bucket_count > (1 << 8)) && (table->entry_count < table->growth_num * ((table->bucket_count >> 1) / table->growth_den))) {
         change = -1;
      } else {
         return;
      }
   }
   uint32_t new_bucket_count = (change < 0) ? table->bucket_count >> 1 : table->bucket_count << 1;
   uint32_t new_hash_mask = new_bucket_count - 1;
   hash_entry_t** new_buckets = (hash_entry_t**)calloc(new_bucket_count, sizeof(hash_entry_t*));
   if (!new_buckets)
      return;
   memset(new_buckets, 0, new_bucket_count * sizeof(hash_entry_t*));
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets = 0;
   #endif
   hash_entry_t* entry;
   #ifdef HTABLE_MAINTAIN_LIST
      entry = table->first;
      while (entry) {
         int r = _htable_bucket_insert(new_buckets, entry, new_hash_mask);
         #ifdef HTABLE_KEEP_STATS
         table->used_buckets += r;
         #endif
         entry = entry->next;
      }
   #else
      hash_entry_t* next;
      for (uint32_t i=0; i < table->bucket_count; i++) {
         entry = table->buckets[i];
         while (entry) {
            next = entry->bucket_next;
            int r = _htable_bucket_insert(new_buckets, entry, new_hash_mask);
            #ifdef HTABLE_KEEP_STATS
            table->used_buckets += r;
            #endif
            entry = next;
         }
      }
   #endif
   free(table->buckets);
   table->buckets = new_buckets;
   table->bucket_count = new_bucket_count;
   table->hash_mask = new_hash_mask;
}


// Get the pointer to the value of the entry or NULL if not in table
static YOUR_VALUE_T* htable_value(const hash_table_t* table, const char* key, size_t key_len) {
   // un-const table so that find_entry can keep statistics
   hash_entry_t* entry = htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL);
   return (entry != NULL) ? &entry->value : NULL;
}

static YOUR_VALUE_T htable_get(const hash_table_t* table, const char* key, size_t key_len, const YOUR_VALUE_T default_value) {
   // un-const table so that find_entry can keep statistics
   hash_entry_t* entry = htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL);
   return (entry != NULL) ? entry->value : default_value;
}

static int htable_exists(const hash_table_t* table, const char* key, size_t key_len) {
   // un-const table so that find_entry can keep statistics
   return (htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL) != NULL);
}

// Add a new entry (but don't update if it already exists)
// Returns NULL if the entry already exists (use set() if you want add or update logic)
static hash_entry_t* htable_add(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   uint32_t hash;
   hash_entry_t* res = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (res != NULL)
      return NULL;
   if ((res = (hash_entry_t*)malloc(sizeof(hash_entry_t))) == NULL)
      return NULL;
   if ((res->key = (char*)malloc(key_len + 1)) == NULL) {
      free(res);
      return NULL;
   }
   memcpy(res->key, key, key_len + 1);
   res->key_len = key_len;
   res->hash = hash;
   res->value = value;
   #ifdef HTABLE_MAINTAIN_LIST
   res->prev = NULL;
   res->next = NULL;
   if (table->first == NULL) {
      table->first = res;
      table->last = res;
   } else {
      res->prev = table->last;
      table->last->next = res;
      table->last = res;
   }
   #endif
   int r = _htable_bucket_insert(table->buckets, res, table->hash_mask);
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets += r;
   #endif
   table->entry_count++;
   _htable_adjust_as_needed(table);
   return res;
}


static hash_entry_t* htable_set(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   uint32_t hash;
   hash_entry_t* res = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (res != NULL) {
      res->value = value;
      return res;
   }
   if ((res = (hash_entry_t*)malloc(sizeof(hash_entry_t))) == NULL)
      return NULL;
   if ((res->key = (char*)malloc(key_len + 1)) == NULL) {
      free(res);
      return NULL;
   }
   memcpy(res->key, key, key_len + 1);
   res->key_len = key_len;
   res->hash = hash;
   res->value = value;
   #ifdef HTABLE_MAINTAIN_LIST
   res->prev = NULL;
   res->next = NULL;
   if (table->first == NULL) {
      table->first = res;
      table->last = res;
   } else {
      res->prev = table->last;
      table->last->next = res;
      table->last = res;
   }
   #endif
   int r = _htable_bucket_insert(table->buckets, res, table->hash_mask);
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets += r;
   #endif
   table->entry_count++;
   _htable_adjust_as_needed(table);
   return res;
}

// Update an entry but don't add a a new entry it doesn't already exist. Returns NULL if doesn't exist
static hash_entry_t* htable_update(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   hash_entry_t* res = htable_find_entry(table, key, key_len, NULL, NULL);
   if (res == NULL)
      return NULL;
   res->value = value;
   return res;
}

// Delete an entry. Returns 1 on success or 0 if the entry didn't exist
static int htable_delete(hash_table_t* table, const char* key, size_t key_len) {
   uint32_t hash;
   hash_entry_t* entry = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (entry == NULL)
      return 0;

   #ifdef HTABLE_MAINTAIN_LIST
      if (entry == table->first)
         table->first = entry->next;
      if (entry == table->last) {
         table->last = entry->prev;
      }
      if (entry->prev != NULL)
         entry->prev->next = entry->next;
      if (entry->next != NULL)
         entry->next->prev = entry->prev;
   #endif

   uint32_t bucket = hash & table->hash_mask;
   hash_entry_t* bhead = table->buckets[bucket];
   hash_entry_t* bprev = NULL;
   #ifdef HTABLE_FAST_DELETES
      bprev = entry->bucket_prev;
   #else
      if (bhead != entry) {
         bprev = bhead;
         while (bprev->bucket_next != entry)
            bprev = bprev->bucket_next;
      }
   #endif
   if (bprev != NULL)
      bprev->bucket_next = entry->bucket_next;

   #ifdef HTABLE_FAST_DELETES
      if (entry->bucket_next != NULL)
         entry->bucket_next->bucket_prev = entry->bucket_next;
   #endif

   if (bhead == entry) {
      table->buckets[bucket] = entry->bucket_next;
      #ifdef HTABLE_KEEP_STATS
         if (entry->bucket_next == NULL)
            table->used_buckets--;
      #endif
   }

   free(entry->key);
   free(entry);
   table->entry_count--;

   #ifndef HTABLE_FAST_DELETES
      htable_pack(table);
   #endif
   return 1;
}

static void htable_pack(hash_table_t* table) {
   _htable_adjust_as_needed(table);
}

Usage examples (as assertions) and efficiency tests. Using int as the data value type...

hash_table_t* ht = htable_create(0, 0, 0);
assert(ht != NULL);  // Table was created successfully

// Testing basic adding/updating/getting...
assert(htable_add(ht, "hello-world", 0, 234) != NULL); // hello-world set to 234
assert(htable_add(ht, "goodbye-world", 0, 567) != NULL); // goobye-world set to 567
assert(ht->entry_count == 2); // Two entries exist (hello-world and goodbye-world)
assert(htable_exists(ht, "hello-world", 0) == 1); // hello-world exists
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goodbye-world exists
assert(htable_exists(ht, "unknown-world", 0) == 0); // unknown-world doesn't exist
assert(*htable_value(ht, "hello-world", 0) == 234); // hello-world has a value of 234
assert(*htable_value(ht, "goodbye-world", 0) == 567); // goodbye-world has a value of 567
assert(htable_get(ht, "hello-world", 0, -1) == 234); // hello-world exists and has a value of 234
assert(htable_get(ht, "goodbye-world", 0, -1) == 567); // goobye-world exists and has a value of 567
assert(htable_get(ht, "unknown-world", 0, -1) == -1); // unknown-world does not exist so the default value of -1 is returned
*htable_value(ht, "hello-world", 0) = -1; // hello-world's value is directly set via reference to -1
*htable_value(ht, "goodbye-world", 0) = -2; // goodbye-world's value is directly set via reference to -2
assert(*htable_value(ht, "hello-world", 0) == -1); // hello-world has a value of -1
assert(*htable_value(ht, "goodbye-world", 0) == -2); // goodbye-world has a value of -2
assert(htable_update(ht, "hello-world", 0, 1000) != NULL); // hello-world set to 1000
assert(htable_update(ht, "goodbye-world", 0, 2000) != NULL); // goodbye-world set to 2000
assert(htable_update(ht, "unknown-world", 0, 3000) == NULL); // unknown-world not set (it doesn't exist);
assert(ht->entry_count == 2); // Two entries exist (hello-world and goodbye-world)
assert(htable_set(ht, "hello-world", 0, 1111) != NULL); // hello-world set to 1111
assert(htable_set(ht, "goodbye-world", 0, 2222) != NULL); // goodbye-world set to 2222
assert(htable_set(ht, "unknown-world", 0, 3333) != NULL); // unknown-world added with a value of 3333
assert(ht->entry_count == 3); // Three entries exist (hello-world, goodbye-world, and unknown-world)
printf("%s\n", "After all additions and changes:");
#ifdef HTABLE_MAINTAIN_LIST
// A foreach iteration
hash_entry_t* entry = ht->first;
while (entry != NULL) {
   printf("\"%s\" = %i\n", entry->key, entry->value);
   entry = entry->next;
}
#endif
#ifdef HTABLE_KEEP_STATS
assert(ht->entry_count - ht->used_buckets == 0); // Means that no hash collisions occured
assert(ht->misses == 0); // Means that each lookup was in O(1) time
#endif

// Testing basic deletion...
assert(htable_delete(ht, "not-a-world", 0) == 0); // not-a-world not deleted (doesn't exist)
assert(htable_delete(ht, "hello-world", 0) == 1); // hello-world deleted
assert(htable_delete(ht, "hello-world", 0) == 0); // hello-world not deleted (doesn't exist)
assert(htable_exists(ht, "hello-world", 0) == 0); // hello-world doesn't exit
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goobye-world still exists
assert(htable_exists(ht, "unknown-world", 0) == 1); // unknown-world still exists
assert(ht->entry_count == 2); // Two entries exists (goodbye-world and unknown-world)
assert(htable_delete(ht, "unknown-world", 0) == 1); // unknown-world deleted
assert(htable_exists(ht, "unknown-world", 0) == 0); // unknown-world doesn't exit
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goodbye-world still exists
assert(ht->entry_count == 1); // One entry exists (goodbye-world)
#ifdef HTABLE_MAINTAIN_LIST
// A foreach iteration
printf("%s\n", "After deletion:");
entry = ht->first;
while (entry != NULL) {
   printf("\"%s\" = %i\n", entry->key, entry->value);
   entry = entry->next;
}
#endif

#ifdef HTABLE_KEEP_STATS
assert(ht->entry_count - ht->used_buckets == 0); // Means that no hash collisions occured
assert(ht->misses == 0); // Means that each lookup was in O(1) time
#endif

htable_free(ht);

Additionally I did some tests using 100,000 randomly generated ASCII keys with lengths between 5 and 1000 characters that showed the following...

  • After random entries using default parameters:
    • Entries: 100000
    • Buckets: 131072
    • Used buckets: 69790
    • Collisions: 30210
    • Misses: 71394
    • Hash/Bucket Efficiency: 69.79%
  • After random entries using a growth ratio of 1/2:
    • Entries: 100000
    • Buckets: 262144
    • Used buckets: 83181
    • Collisions: 16819
    • Misses: 35436
    • Hash/Bucket Efficiency: 83.18%
  • After random entries using a growth ratio of 2/1:
    • Entries: 100000
    • Buckets: 65536
    • Used buckets: 51368
    • Collisions: 48632
    • Misses: 141607
    • Hash/Bucket Efficiency: 51.37%

As you can see, it has the potential to perform quite well. An efficiency of 80% means that approximately 80% of the lookups are O(1), about 16% of the lookups are O(2), about 3.2% of the lookups are O(3), and about 0.8% of lookups are O(4+). This means that on average a lookup would take O(1.248)

Likewise, an efficiency of 50% means that 50% of lookups are O(1), 25% are O(2), 12.5% are O(3), and 12.5% are O(4+)

You really just need to pick (or write) the right hashing algorithm for your known factors and tweak things for your specific needs.

Notes:

  1. Those assertions/tests worked for me but it's not guaranteed to be bug free. It does seem to be quite stable through. There's probably a bug or two floating around in there.
  2. If you need list management you can easily add things like move(), swap(), sort(), insert(), etc by managing entry->prev and entry->next
  3. I couldn't include the test code because I'm seem to have hit the answer size limit.
  4. Neither the hash function or final string comparison is included in the time analysis. This would be impossible to analyze without knowing all statistics about the input. Both functions should be quite fast though and the string comparison could be factored out completely if more information is known about the input data.
寄居人 2024-12-17 00:03:10

如果您知道所有可能的变量名称的集合,那么可以使用将名称完美散列为数字,

但每个散列表最终将具有相同的长度,例如,如果 Xy 是名称,

如果 perfect(str) 变成 'x''y, 则映射的长度始终为 2 ' 变为 0 和 1;那么函数 get 将是

get(field, table) 
{
   return table[perfect(field)];
}

if you know the set of all possible variable names, then it would be possible to use to perfect hash the names to numbers

but each of the hash tables would end up having the same length an example is if X and y are the names then the map would always be of length 2

if perfect(str) turns 'x' and 'y' into 0 and 1; then the function get would be

get(field, table) 
{
   return table[perfect(field)];
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文