从大量文本中提取数千个简单模式的快速算法

发布于 2024-12-23 11:55:10 字数 245 浏览 7 评论 0 原文

我希望能够从 GB 的文本中有效地匹配数千个正则表达式，因为我知道这些正则表达式中的大多数都相当简单，例如：

\bBarack\s(Hussein\s)?Obama\b
\b(John|J\.)\sBoehner\b

等。

我当前的想法是尝试从每个正则表达式中提取某种最长的子字符串，然后使用 Aho-Corasick 匹配这些子字符串并消除大部分正则表达式，然后匹配所有剩余的正则表达式组合。有人能想出更好的办法吗？

原文

I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like:

\bBarack\s(Hussein\s)?Obama\b
\b(John|J\.)\sBoehner\b

etc.

My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

飘然心甜 2024-12-30 11:55:10

您可以使用 (f)lex 生成 DFA，它可以并行识别所有文字。如果存在太多通配符，这可能会变得棘手，但它适用于最多大约 100 个文字（对于 4 个字母的字母表；对于自然文本可能更多）。您可能想要抑制默认操作 (ECHO)，并仅打印匹配的行号和列号。

[我假设 grep -F 的作用大致相同]

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

%%

"TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again"  ); }


"AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!"  ); } 

... 
more lines
...

[bd-fh-su-z]+ {;}

[ \t\r\n]+ {;}

. {;}

%%

int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}

可以使用 awk 或任何其他脚本语言从 txt 输入生成类似上面的脚本。

You can use (f)lex to generate a DFA, which recognises all the literals in parallel. This might get tricky if there are too many wildcards present, but it works for upto about 100 literals (for a 4 letter alfabet; probably more for natural text). You may want to suppress the default action (ECHO), and only print the line+column numbers of the matches.

[ I assume grep -F does about the same ]

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

%%

"TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again"  ); }


"AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!"  ); } 

... 
more lines
...

[bd-fh-su-z]+ {;}

[ \t\r\n]+ {;}

. {;}

%%

int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}

A script like the one above can be generated form txt input with awk or any other script language.

回复收藏 0 原文

情话难免假 2024-12-30 11:55:10

比在每个文件上运行每个正则表达式更聪明的实现：

For each regex:
    load regex into a regex engine
    assemble a list of regex engines
For each byte in the file:
    insert byte to every regex engine
      print results if there are matches

但我不知道有任何程序已经执行此操作 - 您必须自己编写代码。这也意味着您有内存来保持正则表达式状态，并且您没有任何邪恶的正则表达式

A slightly smarter implementation than running every regex on every file:

For each regex:
    load regex into a regex engine
    assemble a list of regex engines
For each byte in the file:
    insert byte to every regex engine
      print results if there are matches

But I don't know of any programs that do this already - you'd have to code it yourself. This also implies you have the ram to keep the regex state around, and that you don't have any evil regexes

回复收藏 0 原文

抹茶夏天i‖ 2024-12-30 11:55:10

我不确定你是否会打破一些正则表达式的大小限制，但你可以将它们全部组合成一个巨大的正则表达式：

((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc))

如果你达到了某个限制，你可以一次使用 100 个块或任意数量的块来完成此操作可以管理

I'm not sure if you'd blow some regex size limit, but you could just OR them all up together into one giant regex:

((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc))

If you hit some limit, you could do this with chunks of 100 at a time or however many you can manage

回复收藏 0 原文

本宫微胖 2024-12-30 11:55:10

如果您需要快速实现某些特定情况，您可以实现后缀树 “http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm” rel="nofollow">Aho–Corasick 算法自己。但在大多数情况下，按照前面的建议，将所有正则表达式合并为单个正则表达式也不错