aho corasick 的可扩展性
我想从关键短语数据库(从维基百科文章标题中提取)中搜索文本文档中出现的关键短语。 (即,给定一个文档,我想查找其中的任何短语是否有相应的维基百科文章)我发现了 Aho-Corasick 算法。我想知道为数百万个条目的字典构建 Aho-Corasick 自动机是否高效且可扩展。
I want to search a text document for occurrences of keyphrases from a database of keyphrases (extracted from wikipedia article titles). (ie. given a document i want to find whether any of the phrases have a corresponding wikipedia article) I found out about the Aho-Corasick algorithm. I want to know if building an Aho-Corasick automaton for dictionary of millions of entries is efficient and scalable.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
让我们做一个简单的计算:
假设您有 100 万个模式(字符串、短语),平均长度为 10 个字符,并且分配给每个模式的值(标签、标记、指针等)长度为 1 个字(4 字节)
然后您将需要一个 10+4=1400 万字节 (14Mb) 的数组来保存模式列表。
从 100 万个模式(每个模式 10 字节(字母、字符))中,您可以构建一个不超过 1000 万个节点的 AC trie。实际上这个 trie 有多大取决于每个节点的大小。
它应该至少保留 1 个字节作为标签(字母)和单词(4 个字节)作为指向 trie 中下一个节点的指针(或终端节点的模式)加上 1 位(布尔值)来标记终端节点,
总共大约 5 个字节
因此,100 万个模式 10 个字符的 trie 的最小大小将需要至少 5000 万个字节或大约 50 Mb 的内存。
实际上,它可能会多 3-10 倍,但仍然非常非常易于管理,因为即使现在 500Mb 内存也非常适中。 (将其与 Word 或 Outlook 等 Windows 应用程序进行比较)
鉴于 Aho-Corasick (AC) 算法在速度方面几乎是无与伦比的,它仍然是有史以来多模式匹配的最佳算法。除了学术垃圾之外,这是我强烈的个人观点。
所有关于可能超越 AC 的“新”最新和最伟大算法的报告都被高度夸大了(可能除了一些具有短模式的特殊情况,例如 DNA)。AC
的唯一改进实际上可以沿着更多、更快的硬件(多核)的方向发展。 ,更快的CPU,集群等)
不要相信我的话,自己测试一下。但请记住,AC 的实际速度很大程度上取决于实现(语言和编码质量)
Let's just make a simple calculations :
Assume that you have 1 million patterns (strings, phrases) with average length 10 characters and a value (label, token, pointer etc) of 1 word (4 bytes) length , assigned to each pattern
Then you will need an array of 10+4=14 million bytes (14Mb) just to hold the list of patterns.
From 1 million patterns 10 bytes (letters, chars) each you could build an AC trie with no more than 10 million nodes. How big is this trie in practice depends on the size of each node.
It should at least keep 1 byte for a label (letter) and word (4bytes) for a pointer to a next node in trie (or a pattern for a terminal node) plus 1 bit (boolean) to mark terminal node,
Total about 5 bytes
So, the minimum size of a trie for 1 million patterns 10 chars you will need min 50 million bytes or about 50 Mb of memory.
In practice it might be 3-10 times more , but yet is very-very manageable, as even 500Mb memory is very moderate today. (Compare it with Windows applications like Word or Outlook)
Given that in terms of speed Aho-Corasick (AC) algorithm is almost unbeatable, it still remains the best algorithm for multiple pattern match ever. That's my strong personal educated opinion apart from academic garbage .
All reports of "new" latest and greatest algorithms that might outperform AC are highly exaggerated (except maybe for some special cases with short patterns like DNA)
The only improvement of AC could in practice go along the line of more and faster hardware (multiple cores, faster CPUs, clusters etc)
Don't take my word for it, test it for yourself. But remember that real speed of AC strongly depends on implementation (language and quality of coding)
理论上,它应该保持线性速度,仅受内存层次结构的影响 - 当它变得太大而无法容纳缓存时,它会减慢,当它变得非常大时,如果它开始被调出,你就会遇到问题。
OTOH Aho-Corasick 的最大胜利是在搜索可能出现在输入字符串内任何可能位置的大小合适的子字符串时。如果您的文本文档已经被分割成单词,并且您的搜索短语不超过例如 6 个单词长,那么您可以构建一个 K 单词短语的哈希表,然后从其中的输入文本中查找单词的每个 K 单词连续部分,对于 K = 1..6。
(评论的答案)
Aho-Corasick 需要生活在内存中,因为你将到处跟随指针。如果您必须在内存之外工作,那么回到老式的排序/合并可能是最简单的方法。根据输入数据创建一个包含 K 单词记录的文件,其中 K 是您感兴趣的任何短语中的最大单词数。对其进行排序,然后将其与已排序的维基百科短语文件合并。您几乎可以在 Unix/Linux 上手动完成此操作,使用排序和连接等实用程序以及一些 shell/awk/perl/其他工具。另请参阅http://en.wikipedia.org/wiki/Key_Word_in_Context(我已经足够大了)实际使用过这些索引之一,以计算机打印输出的装订页形式提供)。
In theory, it should maintain linear speed subject only to the effects of the memory hierarchy - it will slow down as it gets too big to fit in cache, and when it gets really big, you'll have problems if it starts getting paged out.
OTOH the big win with Aho-Corasick is when searching for decent sized substrings that may occur at any possible location within the string being fed in. If your text document is already cut up into words, and your search phrases are no more than e.g. 6 words long, then you could build a hash table of K-word phrases, and then look up every K-word contiguous section of words from the input text in it, for K = 1..6.
(Answer to comment)
Aho-Corasick needs to live in memory, because you will be following pointers all over the place. If you have to work outside memory, it's probably easiest to go back to old-fashioned sort/merge. Create a file of K-word records from the input data, where K is the maximum number of words in any phrase you are interested in. Sort it, and then merge it against a file of sorted Wikipedia phrases. You can probably do this almost by hand on Unix/Linux, using utilities such as sort and join, and a bit of shell/awk/perl/whatever. See also http://en.wikipedia.org/wiki/Key_Word_in_Context (I'm old enough to have actually used one of these indexes, provided as bound pages of computer printout).
嗯,有一个解决方法。通过将构建的 AC 字典树以类似 xml 的格式写入文本文件,为该树的前 6 个级别创建索引文件,等等...在我的测试中,我搜索句子中的所有部分匹配项字典(500'000 个条目),对于包含 150-200 个符号的句子,我得到约 100 个结果的约 150 毫秒。
有关更多详细信息,请查看本文:http://212.34.233.26/aram/IJITA17v2A.Avetisyan。文档
Well there is a workaround. By writing the built AC trie of dictionary into a text file in a xml-like format, making an index file for the first 6 levels of that trie, etc... In my tests I search for all partial matches of a sentence in the dictionary (500'000 entries), and I get ~150ms for ~100 results for a sentence of 150-200 symbols.
For more details, check out this paper : http://212.34.233.26/aram/IJITA17v2A.Avetisyan.doc
还有其他获得性能的方法:
- 压缩状态转换:您可以将它们降低到 32 位。
- 放弃指针;将状态转换写入平面向量。
- 将树根附近的节点打包在一起:它们将位于缓存中。
该实现大约需要原始模式集的每个字符 3 个字节,
对于32位节点,可以占用大约10M字符的模式空间。
对于 64 位节点,尚未达到(或达到)限制。
文档:https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/view< /a>
来源:https://github.com/mischasan/aho-corasick
There are other ways to get performance:
- condense state transitions: you can get them down to 32 bits.
- ditch the pointers; write the state transitions to a flat vector.
- pack nodes near the tree root together: they will be in cache.
The implementation takes about 3 bytes per char of the original pattern set,
and for 32-bit nodes, can take a pattern space of about 10M chars.
For 64-bit nodes, have yet to hit (or figure) the limit.
Doc: https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/view
Src: https://github.com/mischasan/aho-corasick