节省空间的特里树

发布于 2024-11-16 10:48:53 字数 619 浏览 4 评论 0原文

我正在尝试在 C 中实现一个节省空间的 trie。这是我的结构:

struct node {
char val; //character stored in node
int key; //key value if this character is an end of word
struct node* children[256];
};

当我添加一个节点时,它的索引是字符的 unsigned char 转换。例如,如果我想添加“c”,则

children[(unsigned char)'c']

是指向新添加的节点的指针。但是,此实现要求我声明一个包含 256 个元素的 node* 数组。我想要做的是:

struct node** children;

然后在添加节点时,只需为该节点分配空间并

children[(unsigned char)'c']

指向新节点。问题是,如果我不首先为子级分配空间,那么我显然无法引用任何索引,否则这是一个很大的错误。

所以我的问题是:如何实现一个 trie,使其仅存储指向其子级的非空指针?

I'm trying to implement a space efficient trie in C. This is my struct:

struct node {
char val; //character stored in node
int key; //key value if this character is an end of word
struct node* children[256];
};

When I add a node, it's index is the unsigned char cast of the character. For example, if I want to add "c", then

children[(unsigned char)'c']

is the pointer to the newly added node. However, this implementation requires me to declare a node* array of 256 elements. What I want to do is:

struct node** children;

and then when adding a node, just malloc space for the node and have

children[(unsigned char)'c']

point to the new node. The issue is that if I don't malloc space for children first, then I obviously can't reference any index or else that's a big error.

So my question is: how do I implement a trie such that it only stores the non-null pointers to its children?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

溺孤伤于心 2024-11-23 10:48:53

您可以尝试使用 de la Briandais trie,其中每个节点只有一个子指针,并且每个节点还有一个指向“兄弟节点”的指针,因此所有兄弟节点都有效地存储为链表,而不是直接由父节点指向。

You could try using a de la Briandais trie, where you only have one child pointer for each node, and every node also has a pointer to a "sibling", so that all siblings are effectively stored as a linked list rather than directly pointed to by the parent.

拥抱没勇气 2024-11-23 10:48:53

您实际上不可能同时拥有这两种方法,既能节省空间,又能在子节点中进行 O(1) 查找。

当您只为实际添加的条目而不是空指针分配空间时,您不能再这样做

children[(unsigned char)'c']

,因为您不能再直接索引到数组中。

一种替代方法是简单地对子级进行线性搜索。并存储 children 数组有多少条目的附加计数,即

children[(unsigned char)'c'] = ...;

必须成为

for(i = 0; i < len; i++) {
  if(children[i] == 'c')
     break;
} 
if(i == len) {
  //...reallocate and add space for one item in children
}
children[i] = ...;

如果您的树最终在某一层有很多非空条目,您可以按排序顺序插入子条目,并且进行二分搜索。或者您可以将子级添加为链接列表而不是数组。

You can't really have it both ways and be both space efficient and have O(1) lookup in the children nodes.

When you only allocate space for the entries that's actually added, and not the null pointers, you can no longer do

children[(unsigned char)'c']

As you can no longer index directly into the array.

One alternative is to simply do a linear search through the children. and store an additional count of how many entries the children array has i.e.

children[(unsigned char)'c'] = ...;

Have to become

for(i = 0; i < len; i++) {
  if(children[i] == 'c')
     break;
} 
if(i == len) {
  //...reallocate and add space for one item in children
}
children[i] = ...;

If your tree ends up with a lot of non-empty entries at one level, you might insert the children in sorted order and do a binary search. Or you might add the childrens as a linked list instead of an array.

你是暖光i 2024-11-23 10:48:53

如果你只想进行英文关键词搜索,我认为你可以最小化你的孩子的大小,从 256 到仅仅 26 - 足以覆盖 26 个字母 az。

此外,您可以使用链表来保持子级的数量更小,这样我们就可以进行更有效的迭代。

我还没有浏览过这些库,但我认为 trie 实现 会有所帮助。

If you just want to do an English keyword search, I think you can minimize the size of your children, from 256 to just 26 - just enough to cover the 26 letters a-z.

Furthermore, you can use a linked list to keep the number of children even smaller so we can have more efficient iteration.

I haven't gone through the libraries yet but I think trie implementation will help.

尴尬癌患者 2024-11-23 10:48:53

通过将每个节点的子节点设置为节点的哈希表,您可以既节省空间又保持恒定的查找时间。特别是当涉及到 Unicode 字符并且字典中可以包含的字符集不限于 52 个以上时,这更多的是一个要求而不是一个细节。这样您就可以保留使用 trie 的优点,同时提高时间和空间效率。

我还必须补充一点,如果您使用的字符集接近无界,那么有一个节点链接列表可能就可以了。如果您喜欢难以管理的噩梦,您可以选择一种混合方法,其中前几个级别将其子级保留在哈希表中,而较低级别则有一个它们的链接列表。对于真正的错误农场,请选择动态的错误农场,其中当每个链接列表通过阈值时,您可以将其动态转换为哈希表。您可以轻松地摊销成本。

可能性是无限的!

You can be both space efficient and keep the constant lookup time by making child nodes of every node a hash table of nodes. Especially when Unicode characters are involved and the set of characters you can have in your dictionary is not limited to 52 + some, this becomes more of a requirement than a nicety. This way you can keep the advantages of using a trie and be time and space efficient at the same time.

I must also add that if the character set you are using is approaching unbounded, chances are having a linked list of nodes may just do fine. If you like an unmanageable nightmare, you can opt for a hybrid approach where first few levels keep their children in hash tables while the lower levels have a linked list of them. For a true bug farm, opt for a dynamic one where as each linked list passes a threshold, you convert it to a hash table on the fly. You could easily amortize the cost.

Possibilities are endless!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文