当前位置：文江博客话题详情

开放寻址与单独链接

发布于 2024-09-30 13:48:44 字数 108 浏览 12 评论 0原文

当负载因子接近 1 时，哪种 hashmap 碰撞处理方案更好，以确保最小的内存浪费？

我个人认为答案是使用线性探测的开放寻址，因为在发生冲突时它不需要任何额外的存储空间。这是正确的吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

黒涩兲箜 2024-10-07 13:48:44

回答这个问题：当负载因子接近 1 时，哪种哈希图碰撞处理方案更好，以确保最小的内存浪费？

允许高频率的开放寻址/探测因为正如您自己所说，碰撞不需要额外的空间（只是，可能是时间——当然这也是假设散列函数不完美）。

如果您没有指定“负载因子接近 1”或在问题中包含“成本”指标，那么情况将完全不同。

快乐编码。

回复收藏 0 原文

心如荒岛 2024-10-07 13:48:44

如此完整的哈希图将降级为线性搜索，因此您需要将它们保持在 90% 以下。

您关于使用较少内存的开放寻址是正确的，链接将需要每个节点中的指针或偏移字段。

当我需要非常轻量级的哈希表且不会有大量插入时，我创建了一个哈希数组数据结构。为了保持较低的内存使用率，所有数据都嵌入在同一内存块中，以 HashArray 结构开头，然后是两个用于哈希值和哈希值的数组。价值观。 Hasharray 只能与存储在值中的查找键一起使用。

typedef uint16_t HashType;  /* this can be 32bits if needed. */
typedef uint16_t HashSize;  /* this can be made 32bits if large hasharrays are needed. */
struct HashArray {
  HashSize length;        /* hasharray length. */
  HashSize count;         /* number of hash/values pairs contained in the hasharray. */
  uint16_t value_size;    /* size of each value.  (maximum size of value 64Kbytes) */
  /* these last two fields are just for show, they are not defined in the HashArray struct. */
  uint16_t hashs[length]; /* array of hashs for each value, this helps with resolving bucket collision */
  uint8_t  values[length * value_size]; /* array holding all values. */
};
#define hasharray_get_hashs(array) (HashType *)(((uint8_t *)(array)) + sizeof(HashArray))
#define hasharray_get_values(array) ((uint8_t *)(array)) + sizeof(HashArray) + \
                                       ((array)->length * sizeof(HashType))
#define hasharray_get_value(array, idx) (hasharray_get_values(array) + ((idx) * (array)->value_size))

宏 hasharray_get_hashs & hasharray_get_values 用于获取“哈希”和“哈希”。 “值”数组。

我已使用它来添加对已存储在数组中的复杂对象的快速查找。这些对象有一个用于查找的字符串“名称”字段。名称被散列并与对象索引一起插入到散列数组中。存储在 hasharray 中的值可以是索引/指针/整个对象（我只使用小的 16 位索引值）。

如果你想打包 hasharray 直到它几乎满了，那么你将需要使用完整的 32 位哈希而不是上面定义的 16 位哈希。当散列数组超过 90% 满时，较大的 32 位散列将有助于保持快速搜索。

上面定义的 hasharray 最多只能容纳 65535，这很好，因为我从来没有在任何有超过几百个值的东西上使用它。任何需要更多的东西都应该使用普通的哈希表。但如果内存确实是一个问题，HashSize 类型可以更改为 32 位。我还使用 2 的幂长度来保持哈希查找速度快。有些人更喜欢使用质数桶长度，但只有当哈希函数分布不良时才需要这样做。

#define hasharray_empty_hash 0xFFFF /* hash value to mark empty slots. */
void *hasharray_search(HashArray *array, HashType hash, uint32_t *next) {
  HashType *hashs = hasharray_get_hashs(array);
  uint32_t mask = array->length - 1;
  uint32_t start_idx;
  uint32_t idx;

  hash = (hash == hasharray_empty_hash) ? 0 : hash; /* need one hash value to mark empty slots. */
  start_hash_idx = (hash & mask);
  if(*next == 0) {
    idx = start_idx; /* new search. */
  } else {
    idx = *next & mask; /* continuing search to next slot. */
  }

  /* find hash in hash array. */
  do {
    /* check for hash match. */
    if(hashs[idx] == hash) goto found_hash;
    /* check for end of chain. */
    if(hashs[idx] == hasharray_empty_hash) break;
    idx++;
    idx &= mask;
  } while(idx != start_idx);
  /* maximum tries reached (i.e. did a linear search of whole array) or end of chain. */
  return NULL;

found_hash:
  *next = idx + 1; /* where to continue search at, if this is not the right value. */
  return hasharray_get_values(array) + (idx * array->value_size);
}

哈希冲突将会发生，因此调用 hasharray_search() 的代码需要将搜索键与存储在值对象中的键进行比较。如果它们不匹配，则再次调用 hasharray_search()。也可以存在非唯一键，因为搜索可以继续，直到返回“NULL”以查找与一个键匹配的所有值。搜索功能使用线性探测来友好地进行缓存。

typedef struct {
  char *name;   /* this is the lookup key. */
  char *type;
  /* other field info... */
} Field;

typedef struct {
  Field *list;          /* array of Field objects. */
  HashArray *lookup;    /* hasharray for fast lookup of Field objects by name.  The values stored in this hasharray are 16bit indices. */
  uint32_t field_count; /* number of Field objects in 'list'. */
} Fields;

extern Fields *fields_new(uint16_t count) {
  Fields *fields;
  fields = calloc(1, sizeof(Fields));
  fields->list = calloc(count, sizeof(Field));
  /* allocate hasharray to hold at most 'count' uint16_t values.
   * The hasharray will round 'count' up to the next power-of-2.
   * That power-of-2 length must be atleast (count+1), so that there will always be one empty slot.
   */
  fields->lookup = hasharray_new(count, sizeof(uint16_t));
  fields->field_count = count;
}

extern Field *fields_lookup_by_name(Fields *fields, const char *name) {
  HashType hash = str_to_hash(name);
  Field *field;
  uint32_t next = 0;
  uint16_t *rc;
  uint16_t idx;

  do {
    rc = hasharray_search(fields->lookup, hash, &next);
    if(rc == NULL) break; /* field not found. */
    /* found a possible match. */
    idx = *rc;
    assert(idx < fields->field_count);
    field = &(fields->list[idx]);
    /* compare lookup name with field's name. */
    if(strcmp(name, field->name) == 0) {
      /* found match. */
      return field;
    }
    /* field didn't match continue search to next field. */
  } while(1);
  return NULL;
}

如果数组已满 99% 并且键不存在，最坏情况下的搜索将降级为整个数组的线性搜索。如果键是整数，那么线性搜索应该不会太糟糕，而且只有具有相同哈希值的键才需要比较。我尝试保持散列数组的大小，使它们仅占 70-80% 左右，如果值仅为 16 位值，则在空槽上浪费的空间并不多。通过这种设计，当使用 16 位哈希值时，每个空槽仅浪费 4 个字节。 16 位索引值。对象数组（上例中的 Field 结构）没有空位。

另外，我见过的大多数哈希表实现都不存储计算出的哈希值，并且需要完整的键比较来解决存储桶冲突。比较哈希值有很大帮助，因为只有一小部分哈希值用于查找存储桶。

A hashmap that is that full will degrade into a linear search, so you will want to keep them under 90% full.

You are right about open addressing using less memory, chaining will need a pointer or offset field in each node.

I have created a hasharray data structure for when I need very lightweight hashtables that will not have alot of inserts. To keep memory usage low all data is embedded in the same block of memory, with the HashArray structure at the start, then two arrays for hashs & values. Hasharray can only be used with the lookup key is stored in the value.

typedef uint16_t HashType;  /* this can be 32bits if needed. */
typedef uint16_t HashSize;  /* this can be made 32bits if large hasharrays are needed. */
struct HashArray {
  HashSize length;        /* hasharray length. */
  HashSize count;         /* number of hash/values pairs contained in the hasharray. */
  uint16_t value_size;    /* size of each value.  (maximum size of value 64Kbytes) */
  /* these last two fields are just for show, they are not defined in the HashArray struct. */
  uint16_t hashs[length]; /* array of hashs for each value, this helps with resolving bucket collision */
  uint8_t  values[length * value_size]; /* array holding all values. */
};
#define hasharray_get_hashs(array) (HashType *)(((uint8_t *)(array)) + sizeof(HashArray))
#define hasharray_get_values(array) ((uint8_t *)(array)) + sizeof(HashArray) + \
                                       ((array)->length * sizeof(HashType))
#define hasharray_get_value(array, idx) (hasharray_get_values(array) + ((idx) * (array)->value_size))

The macros hasharray_get_hashs & hasharray_get_values are used to get the 'hashs' & 'values' arrays.

I have used this for adding fast lookup of complex objects that are already stored in an array. The objects have a string 'name' field which is used for the lookup. The names are hashed and inserted into the hasharray with the objects index. The values stored in the hasharray can be indexes/pointers/whole objects (I only use small 16bit index values).

If you want to pack the hasharray till it is almost full, then you will want to use full 32bit Hashs instead of the 16bit ones defined above. Larger 32bit hashs will help keep searchs fast when the hasharray is more then 90% full.

The hasharray as defined above can only hold a maximum of 65535, which is fine since I never use it on anything that would have more the a few hundred values. Anything that needs more that that should just use an normal hashtable. But if memory is really an issue, the HashSize type could be changed to 32bits. Also I use power-of-2 lengths to keep the hash lookup fast. Some people prefer to use prime bucket lengths, but that is only needed if the hash function has bad distribution.

#define hasharray_empty_hash 0xFFFF /* hash value to mark empty slots. */
void *hasharray_search(HashArray *array, HashType hash, uint32_t *next) {
  HashType *hashs = hasharray_get_hashs(array);
  uint32_t mask = array->length - 1;
  uint32_t start_idx;
  uint32_t idx;

  hash = (hash == hasharray_empty_hash) ? 0 : hash; /* need one hash value to mark empty slots. */
  start_hash_idx = (hash & mask);
  if(*next == 0) {
    idx = start_idx; /* new search. */
  } else {
    idx = *next & mask; /* continuing search to next slot. */
  }

  /* find hash in hash array. */
  do {
    /* check for hash match. */
    if(hashs[idx] == hash) goto found_hash;
    /* check for end of chain. */
    if(hashs[idx] == hasharray_empty_hash) break;
    idx++;
    idx &= mask;
  } while(idx != start_idx);
  /* maximum tries reached (i.e. did a linear search of whole array) or end of chain. */
  return NULL;

found_hash:
  *next = idx + 1; /* where to continue search at, if this is not the right value. */
  return hasharray_get_values(array) + (idx * array->value_size);
}

hash collisions will happen so the code that calls hasharray_search() needs to compare the search key with the one stored in the value object. If they don't match then hasharray_search() is called again. Also non-unique keys can exist, since searching can continue until 'NULL' is returned to find all values that match one key. The search function uses linear probing to be cache freindly.

typedef struct {
  char *name;   /* this is the lookup key. */
  char *type;
  /* other field info... */
} Field;

typedef struct {
  Field *list;          /* array of Field objects. */
  HashArray *lookup;    /* hasharray for fast lookup of Field objects by name.  The values stored in this hasharray are 16bit indices. */
  uint32_t field_count; /* number of Field objects in 'list'. */
} Fields;

extern Fields *fields_new(uint16_t count) {
  Fields *fields;
  fields = calloc(1, sizeof(Fields));
  fields->list = calloc(count, sizeof(Field));
  /* allocate hasharray to hold at most 'count' uint16_t values.
   * The hasharray will round 'count' up to the next power-of-2.
   * That power-of-2 length must be atleast (count+1), so that there will always be one empty slot.
   */
  fields->lookup = hasharray_new(count, sizeof(uint16_t));
  fields->field_count = count;
}

extern Field *fields_lookup_by_name(Fields *fields, const char *name) {
  HashType hash = str_to_hash(name);
  Field *field;
  uint32_t next = 0;
  uint16_t *rc;
  uint16_t idx;

  do {
    rc = hasharray_search(fields->lookup, hash, &next);
    if(rc == NULL) break; /* field not found. */
    /* found a possible match. */
    idx = *rc;
    assert(idx < fields->field_count);
    field = &(fields->list[idx]);
    /* compare lookup name with field's name. */
    if(strcmp(name, field->name) == 0) {
      /* found match. */
      return field;
    }
    /* field didn't match continue search to next field. */
  } while(1);
  return NULL;
}

The worst case searching will degrade to a linear search of the whole array if it is 99% full and the key doesn't exist. If the keys are integers, then a linear search shouldn't be to bad, also only keys with the same hash value will need to be compared. I try to keep the hasharrays sized so they are only about 70-80% full, the space wasted on empty slots isn't much if the values are only 16bit values. With this design you only waste 4bytes per empty slot when using 16bit hashs & 16bit index values. The array of objects (Field structs in the above example) has no empty spots.

Also most hashtable implementations that I have seen don't store the computed hashs and require full key compares to resolve bucket collisions. Comparing the hashs helps a lot since only a small part of the hash value is used to lookup the bucket.

回复收藏 0 原文

别理我 2024-10-07 13:48:44

正如其他人所说，在线性探测中，当负载因子接近1时，时间复杂度接近线性搜索。（当它已满时，它是无限的。）这里存在内存效率的权衡。虽然隔离链理论上总是给我们恒定的时间。

通常，在线性探测下，建议将负载系数保持在 1/8 至 1/2 之间。当数组已满 1/2 时，我们将其大小调整为原始数组大小的两倍。（参考：算法。作者：Robert Sedgewick. Kevin Wayne。）。删除时，我们还将数组大小调整为原始大小的 1/2。如果你真的感兴趣，那么从我上面提到的那本书开始对你来说是有好处的。
实际上，据说0.72是我们通常使用的经验值。

回复收藏 0 原文

~没有更多了~