Aho-Corasick 和真子串
我试图理解 aho-corasick 字符串匹配算法。假设我们的模式是 abcd
和 bc
。我们最终得到一棵像这样的树
[]
/\
[a]..[b]
/ : |
[b].: [c]
| :
[c].....
|
[d]
虚线显示了失效函数。
现在假设我们输入字符串abcd
。这将沿着树并检测匹配“abcd”,但是,据我所知,匹配 bc
将不会被报告。我是否误解了算法?
I'm trying to understand the aho-corasick string match algorithm. Suppose that our patterns are abcd
and bc
. We end up a tree like this
[]
/\
[a]..[b]
/ : |
[b].: [c]
| :
[c].....
|
[d]
The dotted line show the failure function.
Now supposed that we feed in the string abcd
. This will follow the tree and detect the match "abcd," however, as far as I can tell the match bc
will not be reported. Am I misunderstanding the algorithm?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Artem的答案是正确的,但也许不是很清楚。基本上,您需要做的是:每次到达新节点时,检查从该节点开始并由故障链接组成的整个路径,并在该路径上查找匹配项。 (这不会改变您当前的位置)。在您的示例中,您将检查路径 b->b(未找到匹配项)和 c->c(找到匹配项
bc
)。请注意,出于效率原因,您需要在每个节点缓存“下一个匹配”的值。也就是说,如果您检查从节点
u
开始的路径,并在经过几个步骤后找到匹配的节点v
,请记住值next_match[u] = v 以便下次当您必须检查此路径时,您可以直接跳转到
v
。Artem's answer is correct, but perhaps not very clear. Basically what you need to do is: every time you arrive at a new node, examine the whole path starting from this node and consisting of failure links and find matches on this path. (This doesn't change your current position). In your example you would examine the paths b->b (no matches found) and c->c (the match
bc
is found).Note that for efficiency reasons you need to cache the value of 'next match' at each node. That is, if you examine the path starting at node
u
and find a matching nodev
after several steps, remember the valuenext_match[u] = v
so that next time when you have to examine this path, you could make a single jump straight tov
.如果该节点中有字符串结尾,则必须将节点标记为really_final。在您的示例中,此类节点是“abcd”和“bc”。
之后,您必须计算节点的最终状态:如果节点是 real_final 则节点是最终的,或者失败函数的节点是最终的。因此,“abcd”、“bc”和“abc”将是最终的。
换句话说,如果某个模式在这里结束,或者可以从当前节点通过故障链接到达某个最终节点,则该节点是最终节点。
You have to mark nodes as really_final if there is a string ended in this node. In your example such nodes are "abcd" and "bc".
After it you have to calc final states for nodes: node is final if it's really_final or node by failure function is final. So, "abcd", "bc" and "abc" will be final.
In other words - node is final if some pattern is ended here or some final node is reacheable from current node walking by failure links.
设置 AhoCorasick 树的一部分是设置指针,告诉您如果下一个字符不匹配,则应转到树中的何处。例如,如果按照绘制的树中的序列abcq,则需要从abc位置跳转到bc位置,看看bc下面是否有aq。您可以使用此传递来设置另一组指针,以告诉它在发出 abcd 上的匹配信号后发出 bcd 上的匹配信号,依此类推。
在编写它时,我在 http://www.cs.helsinki 找到了 sgrep 的源代码.fi/~jjaakkol/sgrep.html 非常有帮助。据我所知, sgrep 所做的不仅仅是 AhoCorsick,而且还有 AhoCorsick 实现作为其中的一部分。
Part of setting up the AhoCorasick tree is setting up the pointers that tell you where to go in the tree if the next character is not a match. E.g. if you follow the sequence abcq in the tree as you have drawn it, you need to jump from the abc position to the bc position to see if there is a q below bc. You can use this pass to set up another set of pointers to tell it to signal a match on bcd after signalling a match on abcd and so on.
In writing it, I found the source of sgrep at http://www.cs.helsinki.fi/~jjaakkol/sgrep.html very helpful. As far as I can remember, sgrep does a LOT more than just AhoCorasick, but has an AhoCorsick implementation as part of it.