如何识别文本中的一组关键词
我有一大堆关键词。给定一个文本,我希望能够仅识别出现在关键单词列表中的那些单词,并忽略所有其他单词。解决这个问题的最佳方法是什么?
I have a huge set of key words. Given a text , I want to be able to recognize only those words that occur in the key list of words and ignore all the other words. What is the best way to approach this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Aho-Corasick 算法 是一种用于识别一组模式的快速算法较大源字符串中的字符串。它被多个搜索实用程序以及许多防病毒程序所采用,因为它的运行时间为 O(m + n + z),其中 n 是您尝试匹配的所有模式字符串的总大小,m 是要搜索的字符串,z 是匹配的总数。此外,如果您事先知道要搜索的字符串,则可以离线执行 O(n) 工作,并将搜索时间减少到 O(m + z)。
The Aho-Corasick algorithm is a fast algorithm for recognizing a set of pattern strings in a larger source string. It's employed by several search utilities, along with many antivirus programs, since it runs in time O(m + n + z), where n is the total size of all the pattern strings you're trying to match, m is the length of the string to search, and z is the total number of matches. Moreover, if you know in advance what strings you're searching for, you can do the O(n) work offline and reduce the search time to O(m + z).
将您的单词存储在 trie 中。
走你的文字。每次开始一个单词时,就开始遍历特里树。如果您在单词查找树中的某个单词的末尾处结束该单词,则该单词就是您感兴趣的单词。否则就不是。
关于单词的定义,您可能会遇到一些小问题。特别是非单词字符通常会结束单词,但也有例外,例如
don't
。请注意,某些正则表达式引擎(Perl 的任何最新版本的 Perl 中的一个)都足够智能,可以自动构造一个 trie 并尝试匹配它。因此,您很有可能只需使用管道将单词连接在一起,然后将其扔到正则表达式引擎中即可获得良好的性能。
如果这不起作用,您可以构造一个对 trie 进行编码的正则表达式。例如,给定列表
foo
、bar
、baz
、blat
正则表达式/\b( foo|b(?:a(?:r|z)|lat))\b/
应该匹配这些单词并且仅匹配这些单词。它可能不会像手工 C 那样高效(例如,在 Perl 引擎上,您将遇到对执行缓慢的复杂正则表达式的检查,并且它可能会执行一些不需要执行的愚蠢回溯)但整合起来会减少很多工作。Store your words in a trie.
Walk your text. Every time you start a word, start walking the trie. If you end the word at the end of a word in the trie, that is a word you were interested in. Otherwise it wasn't.
You will have minor complications around the definition of a word. In particular non-word characters usually end a word, but there are exceptions such as
don't
.Note that some regular expression engines (Perl's in any recent version of Perl for one) are smart enough to automatically construct a trie and try to match it. Therefore there is a good chance that you can just join your words together with pipes, and throw it at a regular expression engine and get good performance.
If that does not work, you can construct a regular expression that encodes a trie. For instance given the list
foo
,bar
,baz
,blat
the regular expression/\b(foo|b(?:a(?:r|z)|lat))\b/
should match those words and only those words. It probably won't do it as efficiently as hand-rolled C (for instance on Perl's engine you'll be encountering checks for slow-performing complex regular expressions, and it will likely do some silly backtracking that it didn't need to do) but it will be a lot less work to put together.