Aho-Corasick 和真子串

发布于 2024-10-24 11:45:53 字数 330 浏览 8 评论 0原文

我试图理解 aho-corasick 字符串匹配算法。假设我们的模式是 abcdbc。我们最终得到一棵像这样的树

       []
       /\
    [a]..[b]
    /  :  |
   [b].: [c]
    |     :
   [c].....
    |
   [d]

虚线显示了失效函数。

现在假设我们输入字符串abcd。这将沿着树并检测匹配“abcd”,但是,据我所知,匹配 bc 将不会被报告。我是否误解了算法?

I'm trying to understand the aho-corasick string match algorithm. Suppose that our patterns are abcd and bc. We end up a tree like this

       []
       /\
    [a]..[b]
    /  :  |
   [b].: [c]
    |     :
   [c].....
    |
   [d]

The dotted line show the failure function.

Now supposed that we feed in the string abcd. This will follow the tree and detect the match "abcd," however, as far as I can tell the match bc will not be reported. Am I misunderstanding the algorithm?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

烙印 2024-10-31 11:45:53

Artem的答案是正确的,但也许不是很清楚。基本上,您需要做的是:每次到达新节点时,检查从该节点开始并由故障链接组成的整个路径,并在该路径上查找匹配项。 (这不会改变您当前的位置)。在您的示例中,您将检查路径 b->b(未找到匹配项)和 c->c(找到匹配项 bc)。

请注意,出于效率原因,您需要在每个节点缓存“下一个匹配”的值。也就是说,如果您检查从节点 u 开始的路径,并在经过几个步骤后找到匹配的节点 v,请记住值 next_match[u] = v 以便下次当您必须检查此路径时,您可以直接跳转到 v

Artem's answer is correct, but perhaps not very clear. Basically what you need to do is: every time you arrive at a new node, examine the whole path starting from this node and consisting of failure links and find matches on this path. (This doesn't change your current position). In your example you would examine the paths b->b (no matches found) and c->c (the match bc is found).

Note that for efficiency reasons you need to cache the value of 'next match' at each node. That is, if you examine the path starting at node u and find a matching node v after several steps, remember the value next_match[u] = v so that next time when you have to examine this path, you could make a single jump straight to v.

叹梦 2024-10-31 11:45:53

如果该节点中有字符串结尾,则必须将节点标记为really_final。在您的示例中,此类节点是“abcd”和“bc”。
之后,您必须计算节点的最终状态:如果节点是 real_final 则节点是最终的,或者失败函数的节点是最终的。因此,“abcd”、“bc”和“abc”将是最终的。

换句话说,如果某个模式在这里结束,或者可以从当前节点通过故障链接到达某个最终节点,则该节点是最终节点。

You have to mark nodes as really_final if there is a string ended in this node. In your example such nodes are "abcd" and "bc".
After it you have to calc final states for nodes: node is final if it's really_final or node by failure function is final. So, "abcd", "bc" and "abc" will be final.

In other words - node is final if some pattern is ended here or some final node is reacheable from current node walking by failure links.

百善笑为先 2024-10-31 11:45:53

设置 AhoCorasick 树的一部分是设置指针,告诉您如果下一个字符不匹配,则应转到树中的何处。例如,如果按照绘制的树中的序列abcq,则需要从abc位置跳转到bc位置,看看bc下面是否有aq。您可以使用此传递来设置另一组指针,以告诉它在发出 abcd 上的匹配信号后发出 bcd 上的匹配信号,依此类推。

在编写它时,我在 http://www.cs.helsinki 找到了 sgrep 的源代码.fi/~jjaakkol/sgrep.html 非常有帮助。据我所知, sgrep 所做的不仅仅是 AhoCorsick,而且还有 AhoCorsick 实现作为其中的一部分。

Part of setting up the AhoCorasick tree is setting up the pointers that tell you where to go in the tree if the next character is not a match. E.g. if you follow the sequence abcq in the tree as you have drawn it, you need to jump from the abc position to the bc position to see if there is a q below bc. You can use this pass to set up another set of pointers to tell it to signal a match on bcd after signalling a match on abcd and so on.

In writing it, I found the source of sgrep at http://www.cs.helsinki.fi/~jjaakkol/sgrep.html very helpful. As far as I can remember, sgrep does a LOT more than just AhoCorasick, but has an AhoCorsick implementation as part of it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文