基数树/patricia trie 中的前缀搜索

发布于 2024-07-18 06:31:54 字数 987 浏览 11 评论 0原文

我目前正在实现一个基数树/帕特里夏特里(无论你想怎么称呼它)。 我想用它在功能严重不足的硬件上的字典中进行前缀搜索。 它的工作原理或多或少类似于自动完成,即显示与键入的前缀匹配的单词列表。

我的实现基于本文,但其中的代码不包括前缀搜索,尽管作者说:

[...] 假设您想要枚举所有具有公共前缀“AB”的键的节点。 您可以从该根开始执行深度优先搜索,只要遇到后边缘就停止。

但我不明白这应该如何运作。 例如,如果我从这些单词构建一个基数树:

生病
想象的
想象力
想象一下
仿制品
立即
立即
巨大

我将为前缀“i”和“in”获得完全相同的“最佳匹配”,因此对我来说,仅通过从最佳匹配遍历树来收集所有匹配的单词似乎很困难。

此外,还有一个 Java 中的基数树实现,它在 RadixTreeImpl.java。 该代码显式检查所有节点(从某个节点开始)是否有前缀匹配 - 它实际上比较字节。

谁能指出我在基数树上实现前缀搜索的详细描述? Java实现中使用的算法是唯一的方法吗?

I'm currently implementing a radix tree/patricia trie (whatever you want to call it). I want to use it for prefix searches in a dictionary on a severely underpowered piece of hardware. It's supposed to work more or less like auto-completion, i. e. showing a list of words that the typed prefix matches.

My implementation is based on this article, but the code therein doesn't include prefix searches, though the author says:

[...] Say you want to enumerate all the nodes that have keys with a common prefix "AB". You can perform a depth first search starting at that root, stopping whenever you encounter back edges.

But I don't see how that is supposed to work. For example, if I build a radix tree from these words:

illness
imaginary
imagination
imagine
imitation
immediate
immediately
immense
in

I will get the exact same "best match" for the prefixes "i" and "in" so that it seems difficult to me to gather all matching words just by traversing the tree from that best match.

Additionally, there is a radix tree implementation in Java that has an implemented prefix search in RadixTreeImpl.java. That code explicitly checks all nodes (starting from a certain node) for a prefix match - it actually compares bytes.

Can anyone point me to a detailed description on implementing a prefix search on radix trees? Is the algorithm used in the Java implementation the only way to do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

电影里的梦 2024-07-25 06:31:54

想想你的 trie 编码了什么。 在每个节点,您都有通往该节点的路径,因此在您的示例中,您从对应于空字符串的根节点 Λ(这是一个大写的 Lambda,这种希腊字体有点糟糕)开始。 Λ 对于所使用的每个字母都有子代,因此在您的数据集中,您有一个分支,即“i”。

  • Λ
  • Λ→"i"

在“i”节点,有两个子节点,一个代表“m”,一个代表“n”。 下一个字母是“n”,所以你可以认为,

  • Λ→“i”→“n”

,并且由于数据集中唯一以“i”、“n”开头的单词“in ”,没有来自“n”的孩子。 那是一场比赛。

现在,假设数据集不是“in”,而是“infindibulum”。 (我引用的 SF 留作练习。)您仍然会以相同的方式到达“n”节点,但是如果您得到的下一个字母是“q”,您就知道该单词不会出现在你的数据集中,因为没有“q”分支。 那时,你会说“好吧,不匹配”。 (也许您然后开始添加单词,也许不添加,具体取决于应用程序。)

但如果下一个字母是“f”,您可以继续。 不过,您可以通过一些小技巧将其短路:一旦到达代表唯一路径的节点,您就可以将整个字符串挂在该节点上。 当您到达该节点时,您知道字符串的其余部分必须是“findibulum”,因此您使用了前缀来匹配整个字符串并返回它。

你怎么用它? 在许多非 UNIX 命令解释器中,例如旧的 VAX DCL,您可以使用命令的任何唯一前缀。 因此,ls(1) 的等价物是 DIRECTORY,但没有其他命令以 DIR 开头,因此您可以输入 DIR ,结果如下就像做整个词一样好。 如果你记不住正确的命令,你可以只输入“D”,然后按(我认为)ESC; DCL CLI 会返回所有D 开头的命令,它可以非常快速地搜索这些命令。

Think about what your trie encodes. At each node, you have the path that lead you to that node, so in your example, you start at Λ (that's a capital Lambda, this greek font kind of sucks) the root node corresponding to an empty string. Λ has children for each letter used, so in your data set, you have one branch, for "i".

  • Λ
  • Λ→"i"

At the "i" node, there are two children, one for "m" and one for "n". The next letter is "n", so you take that,

  • Λ→"i"→"n"

and since the only word that starts "i","n" in your data set is "in", there are no children from "n". That's a match.

Now, let's say the data set, instead of having "in", had "infindibulum". (What SF I'm referencing is left as an exercise.) You'd still get to the "n" node the same way, but then if the next letter you get is "q", you know the word doesn't appear in your data set at all, because there's no "q" branch. At that point, you say "okay, no match." (Maybe you then start adding the word, maybe not, depending on the application.)

But if the next letter is "f", you can keep going. You can short circuit that with a little craft, though: once you reach a node that represents a unique path, you can hang the whole string off that node. When you get to that node, you know that the rest of the string must be "findibulum", so you've used the prefix to match the whole string, and return it.

How your you use that? in a lot of non-UNIX command interpreters, like the old VAX DCL, you could use any unique prefix of a command. So, the equivalent of ls(1) was DIRECTORY, but no other command started with DIR, so you could type DIR and that was as good as doing the whole word. If you couldn't remember the correct command, you could type just 'D', and hit (I think) ESC; the DCL CLI would return you all the commands that started with D, which it could search extremely fast.

榕城若虚 2024-07-25 06:31:54

事实证明,标准 c++ 库的 GNU 扩展包括 Patricia trie 实现。 它可以在基于策略的数据结构扩展下找到。 请参阅 http://gcc.gnu.org/onlinedocs/libstdc++/ext /pb_ds/trie_based_containers.html

It turns out the GNU extensions for the standard c++ lib includes a Patricia trie implementation. It's found under the policy-based data-structures extension. See http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/trie_based_containers.html

同展鸳鸯锦 2024-07-25 06:31:54

另一种算法:保持简单愚蠢!

只需列出您的关键字的排序列表即可。 当您有前缀时,可以使用二分搜索来查找该前缀在列表中的位置。 将从该索引开始找到所有可能的完成,并准备好就地访问。

该算法只需要 Patricia trie 代码的 5%,并且易于维护、理解和更新。 几乎可以肯定,这种简单的列表搜索也会更加高效。

唯一的缺点是,如果您有大量具有相似前缀的长关键字,则 trie 可以节省一些存储空间,因为它不需要为每个条目保留完整的前缀。 实际上,如果您的字数少于几百万,这并不能节省,因为树的指针开销将占主导地位。 对于像搜索具有数百万个字符的 DNA 字符串数据库(而不是文本关键字)这样的应用程序来说,这种节省更多。

An alternative algorithm: Keep It Simple Stupid!

Just make a sorted list of your keywords. When you have a prefix, binary search to find where that prefix would be located in the list. All of your possible completions will be found starting at that index, ready to be accessed in place.

This algorithm will will require only 5% of the code of a Patricia trie and will be easy to maintain, understand, and update. It is almost certain this simple list search will be more efficient as well.

The only downside is if you have huge numbers of long keywords with similar prefixes, a trie can save some storage since it doesn't need to keep the full prefix for every entry. In practice, if you have less than a few million words, this is not a savings because the pointer overhead of the tree will dominate. This savings is more for applications like searching databases of DNA strings with millions of characters, not text keywords.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文