大数据集哈希和 C 实现

发布于 2024-12-15 09:40:18 字数 477 浏览 2 评论 0原文

我有大量范围从 0 - 5463458053 的值。对于每个值,我希望映射一个包含字符串的集合,以便操作查找,即查找该集合中是否存在字符串最短的时间。请注意,这组值可能不包含 (0 - 5463458053) 中的所有值,但确实包含大量值。

我当前的解决方案是对这些值(0 - 5463458053 之间)进行哈希处理,并为每个值提供与该值对应的字符串链接列表。每次,我想检查给定集合中的字符串,我都会对值(0 - 5463458053 之间)进行哈希处理,获取链表,然后遍历它以找出它是否包含上述字符串。

虽然这看起来更容易,但有点耗时。你能想出更快的解决方案吗?此外,碰撞将会是可怕的。它们会导致错误的结果。

另一部分是关于在 C 中实现这一点。我将如何去做呢?

注意:有人建议使用数据库。我想知道这是否有用。

我有点担心 RAM 自然会耗尽。 :-)

I have a large number of values ranging from 0 - 5463458053. To each value, I wish to map a set containing strings so that the operation lookup, i. e. finding whether a string is present in that set takes the least amount of time. Note that this set of values may not contain all values from (0 - 5463458053), but yes, a large number of them.

My current solution is to hash those values (between 0 - 5463458053) and for each value, have a linked list of strings corresponding to that value. Every time, I want to check for a string in a given set, I hash the value(between 0 - 5463458053), get the linked list, and traverse it to find out whether it contains the aforementioned string or not.

While this might seem easier, it's a little time consuming. Can you think of a faster solution? Also, collisions will be dreadful. They'll lead to wrong results.

The other part is about implementing this in C. How would I go about doing this?

NOTE: Someone suggested using a database instead. I wonder if that'll be useful.

I'm a little worried about running out of RAM naturally. :-)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

你是暖光i 2024-12-22 09:40:18

您可以有一个哈希集的哈希表。第一个哈希表包含整数的键。它里面的值是哈希集,即键是字符串的哈希表。

您还可以拥有一个散列集,其中键是整数和字符串对。

有许多库实现了此类数据结构(在 C++ 中,标准库正在实现它们,如 std::mapstd::set)。对于 C,我想到了 GTK 的 Glib

使用散列技术,内存使用与所考虑的集合(或关系)的大小成正比。例如,您可以接受 30% 的空置率。

You could have an hash-table of hash-sets. The first hash-table has keys your integers. The values inside it are hash-sets, i.e. hash-tables whose keys are strings.

You could also have an hashed set, with the keys being pairs of integers and strings.

There are many libraries implementing such data structures (and in C++, the standard library is implementing them, as std::map & std::set). For C, I was thinking of Glib from GTK.

With hashing techniques, memory use is proportional to the size of the considered sets (or relations). For instance, you could accept 30% emptiness rate.

无法回应 2024-12-22 09:40:18

Judy Array 以及实现它的 C 库可能正是您所需要的基础。这是描述它的引用:

Judy 是一个 C 库,提供最先进的核心技术
实现稀疏动态数组。 Judy 数组已声明
只需使用一个空指针即可。 Judy 数组仅在以下情况下才会消耗内存:
已填充,但可以增长以利用所有可用内存
如果需要的话。 Judy 的主要优势是可扩展性、高性能和
记忆效率。 Judy 数组是可扩展的,可以扩展到
非常大量的元素,仅受机器内存限制。自从
Judy被设计为无界数组,Judy数组的大小为
不是预先分配的,而是随数组动态增长和收缩
人口。 Judy 将可扩展性与易用性结合在一起。朱迪 API
通过简单的插入、检索和删除调用进行访问,而这些调用不需要
需要大量的编程。不需要调整和配置
(事实上​​根本不可能)。此外,还可以进行排序、搜索、计数等
顺序访问功能内置于 Judy 的设计中。

每当开发人员需要动态调整大小的数组时,都可以使用 Judy,
关联数组或简单易用的界面,不需要
进行扩展或收缩的返工。

Judy可以替换很多常见的数据结构,比如数组,稀疏
数组、哈希表、B 树、二叉树、线性列表、跳跃列表、
其他排序和搜索算法以及计数函数。

A Judy Array, with the C library that implements it, might be exactly the base of what you need. Here's a quote that describes it:

Judy is a C library that provides a state-of-the-art core technology
that implements a sparse dynamic array. Judy arrays are declared
simply with a null pointer. A Judy array consumes memory only when it
is populated, yet can grow to take advantage of all available memory
if desired. Judy's key benefits are scalability, high performance, and
memory efficiency. A Judy array is extensible and can scale up to a
very large number of elements, bounded only by machine memory. Since
Judy is designed as an unbounded array, the size of a Judy array is
not pre-allocated but grows and shrinks dynamically with the array
population. Judy combines scalability with ease of use. The Judy API
is accessed with simple insert, retrieve, and delete calls that do not
require extensive programming. Tuning and configuring are not required
(in fact not even possible). In addition, sort, search, count, and
sequential access capabilities are built into Judy's design.

Judy can be used whenever a developer needs dynamically sized arrays,
associative arrays or a simple-to-use interface that requires no
rework for expansion or contraction.

Judy can replace many common data structures, such as arrays, sparse
arrays, hash tables, B-trees, binary trees, linear lists, skiplists,
other sort and search algorithms, and counting functions.

超可爱的懒熊 2024-12-22 09:40:18

大量字符串+快速查找+有限内存---->你想要一个前缀特里树、临界位树或该家族的任何东西(非常相似的东西有许多不同的名称,例如PATRICIA...Judy也是这样的东西)。例如,请参阅

这些数据结构允许前缀压缩,因此它们能够非常有效地存储大量字符串(以某种方式必然具有公共前缀)。而且,查找速度非常快。由于常见的 big-O 表示法没有考虑到缓存和分页效应,它们可以与哈希一样快甚至更快,但只占用一小部分内存(即使根据 big-O,除了可能是一个数组之外什么都没有)可以击败哈希)。

Large number of strings + fast lookup + limited memory ----> you want a prefix trie, crit-bit tree, or anything of that family (many different names for very similar things, e.g. PATRICIA... Judy is one such thing too). See for example this.

These data structores allow for prefix-compression, so they are able to store a lot of strings (which somehow necessarily will have common prefixes) very efficiently. Also, lookup is very fast. Due to caching and paging effects that the common big-O notation does not account for, they can be as fast or even faster than a hash, at a fraction of the memory (even though according to big-O, nothing except maybe an array can beat a hash).

南汐寒笙箫 2024-12-22 09:40:18

您可以使用单个二叉搜索树(AVL/红黑/...)来包含所有集合中的所有字符串,方法是按字典顺序将它们键入为 (set_number, string)。您不需要在任何地方显式存储集合。例如,定义树的节点顺序的比较器可能如下所示:

function compare_nodes (node1, node2) {
    if (node1.set_number < node2.set_number) return LESS;
    if (node1.set_number > node2.set_number) return GREATER;
    if (node1.string < node2.string) return LESS;
    if (node1.string > node2.string) return GREATER;
    return EQUAL;
}

使用这样的结构,可以进行一些常见操作(但可能并不简单)。

要查找字符串 s 是否存在于集合 set_number 中,只需在树中查找 (set_number, s) ,进行精确匹配。

查找集合set_number中的所有字符串:

function iterate_all_strings_in_set (set_number) {
    // Traverse the tree from root downwards, looking for the given key. Return
    // wherever the search ends up, whether it found the value or not.
    node = lookup_tree_weak(set_number, "");

    // tree empty?
    if (node == null) {
        return;
    }

    // We may have gotten the greatest node from the previous set,
    // instead of the first node from the set we're interested in.
    if (node.set_number != set_number) {
        node = successor(node);
    }

    while (node != null && node.set_number == set_number) {
        do_something_with(node.string);
        node = successor(node);
    }
}

上面需要O((k+1)*log(n))时间,其中<​​code>k是set_number中的字符串数量,n是所有字符串的数量。

要查找至少与一个字符串关联的所有集合数字:

function iterate_all_sets ()
{
    node = first_node_in_tree();

    while (node != null) {
        current_set = node.set_number;
        do_something_with(current_set);

        if (cannot increment current_set) {
            return;
        }

        node = lookup_tree_weak(current_set + 1, "");
        if (node.set_number == current_set) {
            node = successor(node);
        }
    }
}

上述需要 O((k+1)*log(n)) 时间,其中 k 是集合至少有一个字符串,n 是所有字符串的数量。

请注意,上面的代码假设树在“do_something”调用中没有被修改;如果删除节点,它可能会崩溃。

另外,这里有一些真实的 C 代码演示了这一点,使用 my自己的通用 AVL 树实现。要编译它,只需从 BadVPN 源代码中复制 misc/struction/ 文件夹并在其中添加包含路径即可。

请注意我的 AVL 树在其节点中不包含任何“数据”,以及它如何不执行任何自己的内存分配。当您有大量数据需要处理时,这会很方便。明确一点:下面的程序只执行一个 malloc(),即分配节点数组的函数。

#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <assert.h>

#include <structure/BAVL.h>
#include <misc/offset.h>

struct value {
    uint32_t set_no;
    char str[3];
};

struct node {
    uint8_t is_used;
    struct value val;
    BAVLNode tree_node;
};

BAVL tree;

static int value_comparator (void *unused, void *vv1, void *vv2)
{
    struct value *v1 = vv1;
    struct value *v2 = vv2;

    if (v1->set_no < v2->set_no) {
        return -1;
    }
    if (v1->set_no > v2->set_no) {
        return 1;
    }

    int c = strcmp(v1->str, v2->str);
    if (c < 0) {
        return -1;
    }
    if (c > 0) {
        return 1;
    }
    return 0;
}

static void random_bytes (unsigned char *out, size_t n)
{
    while (n > 0) {
        *out = rand();
        out++;
        n--;
    }
}

static void random_value (struct value *out)
{
    random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no));

    for (size_t i = 0; i < sizeof(out->str) - 1; i++) {
        out->str[i] = (uint8_t)32 + (rand() % 94);
    }
    out->str[sizeof(out->str) - 1] = '\0';
}

static struct node * find_node (const struct value *val)
{
    // find AVL tree node with an equal value
    BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val);
    if (!tn) {
        return NULL;
    }

    // get node pointer from pointer to its value (same as container_of() in Linux kernel)
    struct node *n = UPPER_OBJECT(tn, struct node, tree_node);
    assert(n->val.set_no == val->set_no);
    assert(!strcmp(n->val.str, val->str));

    return n;
}

static struct node * lookup_weak (const struct value *v)
{
    BAVLNode *tn = BAVL_Lookup(&tree, (void *)v);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

static struct node * first_node (void)
{
    BAVLNode *tn = BAVL_GetFirst(&tree);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

static struct node * next_node (struct node *node)
{
    BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

size_t num_found;

static void iterate_all_strings_in_set (uint32_t set_no)
{
    struct value v;
    v.set_no = set_no;
    v.str[0] = '\0';
    struct node *n = lookup_weak(&v);

    if (!n) {
        return;
    }

    if (n->val.set_no != set_no) {
        n = next_node(n);
    }

    while (n && n->val.set_no == set_no) {
        num_found++; // "do_something_with_string"
        n = next_node(n);
    }
}

static void iterate_all_sets (void)
{
    struct node *node = first_node();

    while (node) {
        uint32_t current_set = node->val.set_no;
        iterate_all_strings_in_set(current_set); // "do_something_with_set"

        if (current_set == UINT32_MAX) {
            return;
        }

        struct value v;
        v.set_no = current_set + 1;
        v.str[0] = '\0';
        node = lookup_weak(&v);

        if (node->val.set_no == current_set) {
            node = next_node(node);
        }
    }
}

int main (int argc, char *argv[])
{
    size_t num_nodes = 10000000;

    // init AVL tree, using:
    //   key=(struct node).val,
    //   comparator=value_comparator
    BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL);

    printf("Allocating...\n");

    // allocate nodes (missing overflow check...)
    struct node *nodes = malloc(num_nodes * sizeof(nodes[0]));
    if (!nodes) {
        printf("malloc failed!\n");
        return 1;
    }

    printf("Inserting %zu nodes...\n", num_nodes);

    size_t num_inserted = 0;

    // insert nodes, giving them random values
    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];

        // choose random set number and string
        random_value(&n->val);

        // try inserting into AVL tree
        if (!BAVL_Insert(&tree, &n->tree_node, NULL)) {
            printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str);
            n->is_used = 0;
            continue;
        }

        n->is_used = 1;
        num_inserted++;
    }

    printf("Looking up...\n");

    // lookup all those values
    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];
        struct node *lookup_n = find_node(&n->val);

        if (n->is_used) { // this node is the only one with this value

            ASSERT(lookup_n == n)
        } else { // this node was an insert collision; some other
                 // node must have this value
            ASSERT(lookup_n != NULL) 
            ASSERT(lookup_n != n)
        }
    }

    printf("Iterating by sets...\n");

    num_found = 0;
    iterate_all_sets();
    ASSERT(num_found == num_inserted)

    printf("Removing all strings...\n");

    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];
        if (!n->is_used) { // must not remove it it wasn't inserted
            continue;
        }
        BAVL_Remove(&tree, &n->tree_node);
    }

    return 0;
}

You can use a single binary search tree (AVL/Red-black/...) to contain all the strings, from all sets, by keying them lexicographically as (set_number, string). You don't need to store sets explicitly anywhere. For example, the comparator defining the order of nodes for the tree could look like:

function compare_nodes (node1, node2) {
    if (node1.set_number < node2.set_number) return LESS;
    if (node1.set_number > node2.set_number) return GREATER;
    if (node1.string < node2.string) return LESS;
    if (node1.string > node2.string) return GREATER;
    return EQUAL;
}

With such a structure, some common operations are possible (but maybe not straightforward).

To find whether a string s exists in the set set_number, simply lookup (set_number, s) in the tree, for an exact match.

To find all strings in the set set_number:

function iterate_all_strings_in_set (set_number) {
    // Traverse the tree from root downwards, looking for the given key. Return
    // wherever the search ends up, whether it found the value or not.
    node = lookup_tree_weak(set_number, "");

    // tree empty?
    if (node == null) {
        return;
    }

    // We may have gotten the greatest node from the previous set,
    // instead of the first node from the set we're interested in.
    if (node.set_number != set_number) {
        node = successor(node);
    }

    while (node != null && node.set_number == set_number) {
        do_something_with(node.string);
        node = successor(node);
    }
}

The above requires O((k+1)*log(n)) time, where k is the number of strings in set_number, and n is the number of all strings.

To find all set numbers with at least one string associated:

function iterate_all_sets ()
{
    node = first_node_in_tree();

    while (node != null) {
        current_set = node.set_number;
        do_something_with(current_set);

        if (cannot increment current_set) {
            return;
        }

        node = lookup_tree_weak(current_set + 1, "");
        if (node.set_number == current_set) {
            node = successor(node);
        }
    }
}

The above requires O((k+1)*log(n)) time, where k is the number of sets with at least one string, and n is the number of all strings.

Note that the above code assumes that the tree is not modified in the "do_something" calls; it may crash if nodes are removed.

Addidionally, here's some real C code which demonstrates this, using my own generic AVL tree implemetation. To compile it, it's enough to copy the misc/ and structure/ folders from BadVPN source somewhere and add an include path there.

Note how my AVL tree does not contain any "data" in its nodes, and how it doesn't do any of its own memory allocation. This comes handy when you have a lot of data to work with. To make it clear: the program below does only a single malloc(), which is the one that allocates the nodes array.

#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <assert.h>

#include <structure/BAVL.h>
#include <misc/offset.h>

struct value {
    uint32_t set_no;
    char str[3];
};

struct node {
    uint8_t is_used;
    struct value val;
    BAVLNode tree_node;
};

BAVL tree;

static int value_comparator (void *unused, void *vv1, void *vv2)
{
    struct value *v1 = vv1;
    struct value *v2 = vv2;

    if (v1->set_no < v2->set_no) {
        return -1;
    }
    if (v1->set_no > v2->set_no) {
        return 1;
    }

    int c = strcmp(v1->str, v2->str);
    if (c < 0) {
        return -1;
    }
    if (c > 0) {
        return 1;
    }
    return 0;
}

static void random_bytes (unsigned char *out, size_t n)
{
    while (n > 0) {
        *out = rand();
        out++;
        n--;
    }
}

static void random_value (struct value *out)
{
    random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no));

    for (size_t i = 0; i < sizeof(out->str) - 1; i++) {
        out->str[i] = (uint8_t)32 + (rand() % 94);
    }
    out->str[sizeof(out->str) - 1] = '\0';
}

static struct node * find_node (const struct value *val)
{
    // find AVL tree node with an equal value
    BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val);
    if (!tn) {
        return NULL;
    }

    // get node pointer from pointer to its value (same as container_of() in Linux kernel)
    struct node *n = UPPER_OBJECT(tn, struct node, tree_node);
    assert(n->val.set_no == val->set_no);
    assert(!strcmp(n->val.str, val->str));

    return n;
}

static struct node * lookup_weak (const struct value *v)
{
    BAVLNode *tn = BAVL_Lookup(&tree, (void *)v);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

static struct node * first_node (void)
{
    BAVLNode *tn = BAVL_GetFirst(&tree);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

static struct node * next_node (struct node *node)
{
    BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node);
    if (!tn) {
        return NULL;
    }

    return UPPER_OBJECT(tn, struct node, tree_node);
}

size_t num_found;

static void iterate_all_strings_in_set (uint32_t set_no)
{
    struct value v;
    v.set_no = set_no;
    v.str[0] = '\0';
    struct node *n = lookup_weak(&v);

    if (!n) {
        return;
    }

    if (n->val.set_no != set_no) {
        n = next_node(n);
    }

    while (n && n->val.set_no == set_no) {
        num_found++; // "do_something_with_string"
        n = next_node(n);
    }
}

static void iterate_all_sets (void)
{
    struct node *node = first_node();

    while (node) {
        uint32_t current_set = node->val.set_no;
        iterate_all_strings_in_set(current_set); // "do_something_with_set"

        if (current_set == UINT32_MAX) {
            return;
        }

        struct value v;
        v.set_no = current_set + 1;
        v.str[0] = '\0';
        node = lookup_weak(&v);

        if (node->val.set_no == current_set) {
            node = next_node(node);
        }
    }
}

int main (int argc, char *argv[])
{
    size_t num_nodes = 10000000;

    // init AVL tree, using:
    //   key=(struct node).val,
    //   comparator=value_comparator
    BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL);

    printf("Allocating...\n");

    // allocate nodes (missing overflow check...)
    struct node *nodes = malloc(num_nodes * sizeof(nodes[0]));
    if (!nodes) {
        printf("malloc failed!\n");
        return 1;
    }

    printf("Inserting %zu nodes...\n", num_nodes);

    size_t num_inserted = 0;

    // insert nodes, giving them random values
    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];

        // choose random set number and string
        random_value(&n->val);

        // try inserting into AVL tree
        if (!BAVL_Insert(&tree, &n->tree_node, NULL)) {
            printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str);
            n->is_used = 0;
            continue;
        }

        n->is_used = 1;
        num_inserted++;
    }

    printf("Looking up...\n");

    // lookup all those values
    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];
        struct node *lookup_n = find_node(&n->val);

        if (n->is_used) { // this node is the only one with this value

            ASSERT(lookup_n == n)
        } else { // this node was an insert collision; some other
                 // node must have this value
            ASSERT(lookup_n != NULL) 
            ASSERT(lookup_n != n)
        }
    }

    printf("Iterating by sets...\n");

    num_found = 0;
    iterate_all_sets();
    ASSERT(num_found == num_inserted)

    printf("Removing all strings...\n");

    for (size_t i = 0; i < num_nodes; i++) {
        struct node *n = &nodes[i];
        if (!n->is_used) { // must not remove it it wasn't inserted
            continue;
        }
        BAVL_Remove(&tree, &n->tree_node);
    }

    return 0;
}
︶ ̄淡然 2024-12-22 09:40:18

如果条目从 0 到 N 并且连续:使用数组。 (索引对您来说足够快吗?)

编辑:数字似乎不连续。有大量的{key,value}对,其中键是一个很大的数字(>32位但<64位),而值是一堆字符串。

如果内存可用,哈希表很容易,如果字符串不是太大,您可以按顺序检查它们。如果相同的字符串出现(多次)多次,您可以枚举这些字符串(将指向它们的指针放入 char * array[] 中,并使用该数组的索引。查找给定字符串的索引可能涉及另一个哈希表)

对于“主”哈希表,条目可能是:

struct entry {
  struct entry *next; /* for overflow chain */
  unsigned long long key; /* the 33bits number */
  struct list *payload;
  } entries[big_enough_for_all] ; /* if size is known in advance
                          , preallocation avoids a lot of malloc overhead */

如果您有足够的内存来存储头数组,那么您当然应该这样做:

struct entry *heads[SOME_SIZE] = {NULL, };

,否则您可以将头数组与条目数组组合起来。 已知的整数键集所做的那样)

(就像我在此处查找 碰撞很容易:当您遍历溢出链时,只需将您的密钥与条目中的密钥进行比较即可。如果它们不相等:继续前进。如果它们相等:找到;现在去抚弦吧。

If the entries are from 0 to N and consecutive: use an array. (Is indexing fast enough for you?)

EDIT: the numbers do not seem to be consecutive. There is a large number of {key,value} pairs, where the key is a big number (>32 bits but < 64 bits) and the value is a bunch of strings.

If memory is available, a hash table is easy, if the bunch of strings is not too large you can inspect them sequentially. If the same strings occur (much) more than once, you could enumerate the strings (put pointers to them in a char * array[] and use the index into that array instead. finding the index given a string probably involves another hash table)

For the "master" hashtable an entry would probably be:

struct entry {
  struct entry *next; /* for overflow chain */
  unsigned long long key; /* the 33bits number */
  struct list *payload;
  } entries[big_enough_for_all] ; /* if size is known in advance
                          , preallocation avoids a lot of malloc overhead */

if you have enough memory to store a heads-array, you chould certainly do that:

struct entry *heads[SOME_SIZE] = {NULL, };

, otherwise you can combine the heads array with the array of entries. (like I did Lookups on known set of integer keys here)

Handling collisions is easy: as you walk the overflow chain, just compare your key with the key in the entry. If they are unequal: walk on. If they are equal: found; now go walking the strings.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文