前缀匹配搜索最适合的数据结构

发布于 2025-01-20 09:12:31 字数 1027 浏览 0 评论 0原文

我必须创建一个客户列表系统(可以多达1000万个客户),每个客户都有一个唯一的ID,唯一的ID由10个字母组成,前3个是大写字母,后7个是数字(例如:LQK0333208,HCK1646129,...)。系统必须以最快的方式执行两种搜索操作(精确匹配搜索和部分匹配搜索):

  • 对于精确匹配搜索,用户输入完整的客户ID,系统显示匹配客户的详细信息,如果没有则显示错误消息匹配的客户。
  • 对于部分匹配搜索,用户输入多个(至少5个,最多8个)客户ID的起始字母,系统显示匹配客户的详细信息,如果没有匹配客户,则显示错误消息。如果匹配的客户数量大于 10,则仅显示其中的 10 个。

那么什么数据结构适合这个系统呢?目前,我正在使用AVL树来处理问题:

  • 对于精确匹配搜索,我将执行对数搜索(左子树和右子树):O(log(n))。
  • 对于部分匹配搜索,我将执行 AVL 树的中序搜索,并检查每个客户是否具有所需的前缀。这是一个线性搜索:O(n)。

但我希望对于部分匹配搜索,系统将在时间复杂度方面执行更好的搜索。 那么关于数据结构的任何建议是否适合系统的要求?

编辑1:我已经使用 Trie 树和三元搜索树测试了该程序,但适用于更大的数据集,例如(1000 万客户)。我无法将内存中的数据结构存储在具有更大数据集的内存中。那么有什么建议吗?
编辑2:我已经测试了排序数组数据结构,它适用于1000万用户的数据集。实际上,当我对 Trie 或三元树一无所知时,这是我的第一个方法。据我了解,首先我们将所有客户存储在一个数组中,然后使用一些排序算法(例如快速排序)对数组进行排序。然后进行二分查找来查找key,就是O(log(n))来进行查找操作,相当不错!但从长远来看,当我们需要向数组中添加额外的数据(不是创建新数据,而是添加到数组中)时,例如仅增加一个客户,因此添加新元素最坏情况下将花费 O(n)情况下,我们需要找到在哪里添加和移动元素。 但对于 Trie 或三叉树这样的数据结构,当添加新元素时,可能只需要 O(1),因为我们只需要遍历树来查找字符串。如果我们不介意空间复杂度,我认为特里树或三叉树最适合这个项目。

I have to create a system of customer list (can be as large as 10million customers), each customer will have a unique ID and a unique ID consists of 10 letters, the first 3 are upper case letters and the last 7 are digits (ex: LQK0333208, HCK1646129,...). The system must perform two search operations in a fastest way (exact matching search and partial matching search):

  • For the exact matching search, users enter a complete Customer ID, and system displays details of the matching customer or an error message if there is no matching customer.
  • For the partial matching search, users enter several (at least 5 and at most 8) starting letters of Customer ID, and system displays details of the matching customers or an error message if there is no matching customer. If the number of matching customers is greater than 10, display only 10 of them.

So what the suitable data structure for this system? Currently, I am using AVL tree to handle the problem:

  • For exact matching search, I will perform a logarithmic search (left and right subtree): O(log(n)).
  • For partial matching search, I will perform a inorder search of the AVL Tree and check if each customer have the demanded prefix. This is a linear search: O(n).

But I want for partial matching search, the system will perform a better search in term of time complexity.
So any suggestion about the data structure is suitable for the system's requirement?

EDIT 1: I have tested the program with Trie Tree and Ternary Search Tree, but for larger dataset like (10 milions customer). There is no way I could store that in-memory data structure in the memory with a larger dataset like that. So any suggestions?
EDIT 2: I have tested the sorted array data structure and It works well with the data set of 10 million users. Actually, this was my first approach when I did not know anything about the Trie or Ternary tree. As far as I understand, first we will store all the customer in an array, then use some sort algorithms like quicksort to sort the array. Then perform binary search to search for the key, which is O(log(n)) to perform the search operation, quite good! But for a long term, when we need to add extra data to the array (not create the new one, but add to the array), for instance just one more customer, so adding the new element will take O(n) in worst case, as we need to find where to add and shift the element.
But for data structure like Trie or Ternary tree, when adding the new element, it might just require O(1) as we just need to traversal the tree to find the string. If we don't mind about the space complexity, I think trie or ternary tree are suit best for this project.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

却一份温柔 2025-01-27 09:12:31

适合此的数据结构是a trie 。这是所有前缀的树,其中每个节点(根除外)代表一个字符,并且从根到叶子的每个可能路径都将是与有效ID相对应的字符序列。

部分匹配意味着从根部以内部节点结尾的路径。

如果通过有效的儿童查找实现,则可以在此特定用例中以10个步骤找到匹配。因此,如果我们认为10是一个常数,则可以在恒定时间内完成匹配,而不论该树的大小(即)。这假设可以通过恒定时间(平均而言)来抬头看孩子的角色。

与此特定用例一​​样,字母表受到限制(仅大写或仅数字),节点最多可以具有26个儿童条目,可以将其存储在该大小的数组中,其中索引映射到相应的字符。这将确保从父节点到相关子节点的持续时间。另外,也可以使用哈希系统(而不是带有26个插槽的数组)。

这是JavaScript中的演示实现(使用儿童的普通对象,即“词典”):

class TrieNode {
    constructor(data=null) {
        this.children = {}; // Dictionary, <character, TrieNode>
        this.data = data; // Non-null when this node represents the end of a valid word
    }
    addWord(word, data) {
        let node = this; // the root of the tree
        for (let ch of word) {
            if (!(ch in node.children)) {
                node.children[ch] = new TrieNode(); 
            }
            node = node.children[ch]; // Walk down the tree
        }
        node.data = data;
    }
    *getAllData() { // This method returns an iterator over all data in this subtree
        if (this.data != null) yield this.data;
        // Recursively yield all data in the children's subtrees
        for (const child in this.children) yield* this.children[child].getAllData();
    }
    *find(prefix) { // This method returns an iterator over matches
        let node = this;
        // Find the node where this prefix ends:
        for (let ch of prefix) {
            if (!(ch in node.children)) return; // No matches
            node = node.children[ch];
        }
        // Yield all data in this subtree
        yield* node.getAllData();
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

// Demo
// Create some Customer data:
const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

// Build a trie for the database of customers
const trie = new TrieNode(); // The root node of the trie.
for (const customer of database) {
    trie.addWord(customer.id, customer);
}
// Make a few queries
console.log("query: LQK0333");
for (const customer of trie.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of trie.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of trie.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of trie.find("LQK09")) console.log("found: " + customer);

排序的阵列

另一种方法是将客户记录存储在排序的数组中。 JavaScript没有这样的数据结构,但是splice在JavaScript中出人意料地快,因此您可以通过将新条目插入其排序位置来维护排序的顺序。二进制搜索可用于找到索引在哪里查找或插入条目:

class SortedArray {
    constructor(keyField) {
        this.arr = [];
        this.keyField = keyField;
    }
    addObject(obj) {
        const i = this.indexOf(obj[this.keyField]);
        if (this.arr[i]?.[this.keyField] === obj[this.keyField]) throw "Duplicate not added";
        this.arr.splice(i, 0, obj);
    }
    *find(prefix) { // This method returns an iterator over matches
        for (let i = this.indexOf(prefix); i < this.arr.length; i++) {
            const obj = this.arr[i];
            if (!obj[this.keyField].startsWith(prefix)) return;
            yield obj;
        }
    }
    indexOf(key) {
        let low = 0, high = this.arr.length;
        while (low < high) {
            const mid = (low + high) >> 1;
            if (key === this.arr[mid][this.keyField]) return mid;
            if (key > this.arr[mid][this.keyField]) {
                low = mid + 1;
            } else {
                high = mid;
            }
        }
        return low;
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

const arr = new SortedArray("id");
for (const customer of database) {
    arr.addObject(customer);
}
console.log("query: LQK0333");
for (const customer of arr.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of arr.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of arr.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of arr.find("LQK09")) console.log("found: " + customer);

A suitable data structure for this is a trie. This is a tree of all prefixes, where each node (except the root) represents a character, and each possible path from root to a leaf will be a character sequence that corresponds to a valid ID.

A partial match means that there is a path from the root that ends in an internal node.

If implemented with an efficient child lookup, a match can in this particular use case be found in 10 steps. So if we consider 10 to be a constant, the match can be done in constant time, irrespective of how large (i.e. how wide) the tree is. This assumes that looking up a child by its character can be done in constant time (on average).

As in this particular use case the alphabet is limited (upper case only or digit only), a node can have at most 26 child entries, which could be stored in an array of that size, where the indexes map to the corresponding character. This will ensure constant time for stepping from a parent node to the relevant child node. Alternatively a hashing system can also be used (instead of an array with 26 slots).

Here is a demo implementation in JavaScript (using a plain object for the children, i.e. a "dictionary"):

class TrieNode {
    constructor(data=null) {
        this.children = {}; // Dictionary, <character, TrieNode>
        this.data = data; // Non-null when this node represents the end of a valid word
    }
    addWord(word, data) {
        let node = this; // the root of the tree
        for (let ch of word) {
            if (!(ch in node.children)) {
                node.children[ch] = new TrieNode(); 
            }
            node = node.children[ch]; // Walk down the tree
        }
        node.data = data;
    }
    *getAllData() { // This method returns an iterator over all data in this subtree
        if (this.data != null) yield this.data;
        // Recursively yield all data in the children's subtrees
        for (const child in this.children) yield* this.children[child].getAllData();
    }
    *find(prefix) { // This method returns an iterator over matches
        let node = this;
        // Find the node where this prefix ends:
        for (let ch of prefix) {
            if (!(ch in node.children)) return; // No matches
            node = node.children[ch];
        }
        // Yield all data in this subtree
        yield* node.getAllData();
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

// Demo
// Create some Customer data:
const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

// Build a trie for the database of customers
const trie = new TrieNode(); // The root node of the trie.
for (const customer of database) {
    trie.addWord(customer.id, customer);
}
// Make a few queries
console.log("query: LQK0333");
for (const customer of trie.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of trie.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of trie.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of trie.find("LQK09")) console.log("found: " + customer);

Sorted Array

Another approach is to store the Customer records in a sorted array. JavaScript has no such data structure, but splice is surprisingly fast in JavaScript, so you could just maintain a sorted order by inserting new entries in their sorted position. Binary search can be used to locate the index where to find or insert an entry:

class SortedArray {
    constructor(keyField) {
        this.arr = [];
        this.keyField = keyField;
    }
    addObject(obj) {
        const i = this.indexOf(obj[this.keyField]);
        if (this.arr[i]?.[this.keyField] === obj[this.keyField]) throw "Duplicate not added";
        this.arr.splice(i, 0, obj);
    }
    *find(prefix) { // This method returns an iterator over matches
        for (let i = this.indexOf(prefix); i < this.arr.length; i++) {
            const obj = this.arr[i];
            if (!obj[this.keyField].startsWith(prefix)) return;
            yield obj;
        }
    }
    indexOf(key) {
        let low = 0, high = this.arr.length;
        while (low < high) {
            const mid = (low + high) >> 1;
            if (key === this.arr[mid][this.keyField]) return mid;
            if (key > this.arr[mid][this.keyField]) {
                low = mid + 1;
            } else {
                high = mid;
            }
        }
        return low;
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

const arr = new SortedArray("id");
for (const customer of database) {
    arr.addObject(customer);
}
console.log("query: LQK0333");
for (const customer of arr.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of arr.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of arr.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of arr.find("LQK09")) console.log("found: " + customer);

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文