如何识别文本中的一组关键词

发布于 2024-11-08 06:28:41 字数 72 浏览 1 评论 0原文

我有一大堆关键词。给定一个文本,我希望能够仅识别出现在关键单词列表中的那些单词,并忽略所有其他单词。解决这个问题的最佳方法是什么?

I have a huge set of key words. Given a text , I want to be able to recognize only those words that occur in the key list of words and ignore all the other words. What is the best way to approach this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

半步萧音过轻尘 2024-11-15 06:28:41

Aho-Corasick 算法 是一种用于识别一组模式的快速算法较大源字符串中的字符串。它被多个搜索实用程序以及许多防病毒程序所采用,因为它的运行时间为 O(m + n + z),其中 n 是您尝试匹配的所有模式字符串的总大小,m 是要搜索的字符串,z 是匹配的总数。此外,如果您事先知道要搜索的字符串,则可以离线执行 O(n) 工作,并将搜索时间减少到 O(m + z)。

The Aho-Corasick algorithm is a fast algorithm for recognizing a set of pattern strings in a larger source string. It's employed by several search utilities, along with many antivirus programs, since it runs in time O(m + n + z), where n is the total size of all the pattern strings you're trying to match, m is the length of the string to search, and z is the total number of matches. Moreover, if you know in advance what strings you're searching for, you can do the O(n) work offline and reduce the search time to O(m + z).

月棠 2024-11-15 06:28:41

将您的单词存储在 trie 中。

走你的文字。每次开始一个单词时,就开始遍历特里树。如果您在单词查找树中的某个单词的末尾处结束该单词,则该单词就是您感兴趣的单词。否则就不是。

关于单词的定义,您可能会遇到一些小问题。特别是非单词字符通常会结束单词,但也有例外,例如 don't

请注意,某些正则表达式引擎(Perl 的任何最新版本的 Perl 中的一个)都足够智能,可以自动构造一个 trie 并尝试匹配它。因此,您很有可能只需使用管道将单词连接在一起,然后将其扔到正则表达式引擎中即可获得良好的性能。

如果这不起作用,您可以构造一个对 trie 进行编码的正则表达式。例如,给定列表 foobarbazblat 正则表达式 /\b( foo|b(?:a(?:r|z)|lat))\b/ 应该匹配这些单词并且仅匹配这些单词。它可能不会像手工 C 那样高效(例如,在 Perl 引擎上,您将遇到对执行缓慢的复杂正则表达式的检查,并且它可能会执行一些不需要执行的愚蠢回溯)但整合起来会减少很多工作。

Store your words in a trie.

Walk your text. Every time you start a word, start walking the trie. If you end the word at the end of a word in the trie, that is a word you were interested in. Otherwise it wasn't.

You will have minor complications around the definition of a word. In particular non-word characters usually end a word, but there are exceptions such as don't.

Note that some regular expression engines (Perl's in any recent version of Perl for one) are smart enough to automatically construct a trie and try to match it. Therefore there is a good chance that you can just join your words together with pipes, and throw it at a regular expression engine and get good performance.

If that does not work, you can construct a regular expression that encodes a trie. For instance given the list foo, bar, baz, blat the regular expression /\b(foo|b(?:a(?:r|z)|lat))\b/ should match those words and only those words. It probably won't do it as efficiently as hand-rolled C (for instance on Perl's engine you'll be encountering checks for slow-performing complex regular expressions, and it will likely do some silly backtracking that it didn't need to do) but it will be a lot less work to put together.

爱你是孤单的心事 2024-11-15 06:28:41
  1. 将您的关键字放入易于查找的数据结构中。例如,哈希表或二叉树。如果您是铁杆玩家,则可以根据关键字创建完美的哈希值。
  2. 使用 DFA 将输入分解为“单词”。这可以通过正则表达式库或简单的状态机来完成。
  3. 查找每个“单词”,看看它是否是您的关键字之一。
  1. Put your keywords into a data structure that allows easy lookup. For example, a hash table or binary tree. If you're hardcore, you can create a perfect hash from your keywords.
  2. Use a DFA to break the input into "words". This can be done with a regular expression library or a simple state machine.
  3. Look up each "word" to see if it's one of your keywords.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文