针对多个正则表达式高效查询一个字符串
假设我有 10,000 个正则表达式和一个字符串,我想查明该字符串是否与其中任何一个匹配,并获取所有匹配项。 最简单的方法是针对所有正则表达式一一查询字符串。 有没有更快、更有效的方法来做到这一点?
编辑: 我尝试用 DFA 的 (lex) 替换它 这里的问题是它只会给你一种模式。 如果我有字符串“hello”和模式“[H|h]ello”和“.{0,20}ello”,DFA 将仅匹配其中之一,但我希望它们都匹配。
Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(18)
这就是词法分析器的工作方式。
正则表达式被转换为单个非确定性自动机 (NFA) 并可能转换为确定性自动机 (DFA)。
生成的自动机将尝试一次匹配所有正则表达式,并将在其中一个上成功。
有许多工具可以在这里为您提供帮助,它们被称为“词法分析器生成器”,并且有适用于大多数语言的解决方案。
你没有说你使用哪种语言。 对于 C 程序员,我建议看看 re2c 工具。 当然,传统的 (f)lex 始终是一种选择。
This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.
我过去也遇到过类似的问题。 我使用了类似于 akdom 建议的解决方案。
我很幸运,因为我的正则表达式通常有一些子字符串必须出现在它匹配的每个字符串中。 我能够使用简单的解析器提取这些子字符串,并使用 Aho-Corasick 算法在 FSA 中对它们进行索引。 然后使用索引快速消除所有与给定字符串不匹配的正则表达式,只留下少数正则表达式需要检查。
我在 LGPL 下将代码作为 Python/C 模块发布。 请参阅 Google 代码托管上的 esmre。
I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.
我们必须在我曾经开发过的产品上执行此操作。 答案是将所有正则表达式一起编译成一个确定性有限状态机(也称为确定性有限状态机)有限自动机或DFA)。 然后,DFA 可以逐个字符地遍历字符串,并在其中一个表达式匹配时触发“匹配”事件。
优点是它运行速度快(每个字符只比较一次),并且如果添加更多表达式也不会变慢。
缺点是自动机需要巨大数据表,并且不支持许多类型的正则表达式(例如反向引用)。
我们使用的那个是由当时我们公司的 C++ 模板坚果手工编码的,所以不幸的是我没有任何 FOSS 解决方案可以向您指出。 但是,如果您使用“DFA”谷歌搜索正则表达式或正则表达式,您会发现一些可以为您指明正确方向的内容。
We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.
Martin Sulzmann 在这个领域做了相当多的工作。
他有一个 HackageDB 项目 简要解释此处使用偏导数似乎是为此量身定做的。
使用的语言是 Haskell,因此如果需要的话,很难翻译成非函数式语言(我认为翻译成许多其他 FP 语言仍然很困难)。
该代码不是基于转换为一系列自动机然后将它们组合起来,而是基于正则表达式本身的符号操作。
此外,该代码是非常实验性的,马丁不再是教授,而是从事“有酬工作”(1),因此可能不感兴趣/无法提供任何帮助或意见。
Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
10,000 个正则表达式是吗? Eric Wendelin关于层次结构的建议似乎是个好主意。 您是否想过将这些正则表达式的庞大程度减少到类似树结构的程度?
举一个简单的例子:所有需要数字的正则表达式都可以从一个正则表达式进行分支检查,所有正则表达式不需要一个数字到另一个分支。 通过这种方式,您可以将实际比较的数量减少到沿着树的路径,而不是每次都进行 10,000 次比较。
这需要将提供的正则表达式分解为流派,每个流派都有一个共享测试,如果失败则将其排除。 通过这种方式,理论上可以大大减少实际比较的次数。
如果您必须在运行时执行此操作,您可以解析给定的正则表达式并将它们“归档”到预定义的流派(最容易做到)或当时生成的比较流派(不那么容易做到)。
您将“hello”与“[H|h]ello”和“.{0,20}ello”进行比较的示例对此解决方案并没有真正的帮助。 一个有用的简单情况是:如果您有 1000 个测试,只有在字符串中某处存在“ello”并且您的测试字符串是“goodbye”时才会返回 true; 你只需要对“ello”进行一次测试,并且知道需要它的 1000 次测试将不起作用,因此,你不必进行这些测试。
10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.
如果您正在考虑“10,000 个正则表达式”,您需要改变您的思考过程。 如果不出意外,请考虑“要匹配的 10,000 个目标字符串”。 然后寻找为处理“大量目标字符串”情况而构建的非正则表达式方法,例如 Aho-Corasick 机器。 不过,坦率地说,似乎在这个过程中比使用哪台机器更早的时候就出现了问题,因为 10,000 个目标字符串听起来更像是数据库查找而不是字符串匹配。
If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.
Aho-Corasick 就是我的答案。
我有 2000 个类别的事物,每个类别都有要匹配的模式列表。 字符串长度平均约为 100,000 个字符。
主要警告:要匹配的模式都是语言模式而不是正则表达式模式,例如
'cat'
与r'\w+'
。我使用的是python,因此使用了https://pypi.python.org/ pypi/pyahocorasick/。
在我的 2000 个类别上运行
%timeit test()
,每个类别大约有 2-3 个跟踪,_string
长度约为100,000
得到了我2.09 毫秒
与执行顺序re.search()
的631 毫秒
快了 315 倍!。Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g.
'cat'
vsr'\w+'
.I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
Running
%timeit test()
on my 2000 categories with about 2-3 traces per category and a_string
length of about100,000
got me2.09 ms
vs631 ms
doing sequentialre.search()
315x faster!.您需要有某种方法来确定给定的正则表达式与另一个正则表达式相比是否是“可加的”。 创建某种正则表达式“层次结构”,允许您确定某个分支的所有正则表达式都不匹配
You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match
您可以将它们组合成大约 20 个的组。
只要每个正则表达式有零个(或至少相同数量)捕获组,您就可以查看捕获的内容以查看哪些模式匹配。
如果 regex1 匹配,捕获组 1 将具有其匹配的文本。 如果没有,则为
undefined
/None
/null
/...You could combine them in groups of maybe 20.
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be
undefined
/None
/null
/...如果您使用真正的正则表达式(与形式语言理论中的正则语言相对应的正则表达式,而不是类似 Perl 的非常规事物),那么您很幸运,因为正则语言在联合下是封闭的。 在大多数正则表达式语言中,管道 (|) 是联合。 因此,您应该能够构造一个字符串(表示您想要的正则表达式),如下所示:
其中括号用于分组,而不是匹配。 与此正则表达式匹配的任何内容都至少与您的原始正则表达式之一匹配。
If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.
如果您只需要知道哪些正则表达式匹配,我建议您使用英特尔的 Hyperscan。 它就是为此目的而构建的。 如果您需要采取的操作更复杂,您也可以使用 ragel。 尽管它生成单个 DFA 并且可以产生许多状态,因此是一个非常大的可执行程序。 Hyperscan 采用混合 NFA/DFA/自定义方法进行匹配,可以很好地处理大量表达式。
I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.
我想说这是真正的解析器的工作。 中点可能是解析表达式语法 (PEG)。 它是模式匹配的更高级别的抽象,一个特性是您可以定义整个语法而不是单个模式。 有一些高性能实现可以将语法编译为字节码并在专门的虚拟机中运行。
免责声明:我唯一知道的是 LPEG,一种Lua 的库,(对我来说)掌握基本概念并不容易。
I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.
我几乎建议编写一个“由内而外”的正则表达式引擎 - 其中“目标”是正则表达式,“术语”是字符串。
然而,迭代尝试每一个的解决方案似乎会容易得多。
I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.
您可以将正则表达式编译为混合 DFA/Bucchi 自动机,其中每次BA 进入接受状态,您标记哪个正则表达式规则“命中”。
Bucchi 对此有点大材小用,但修改 DFA 的工作方式可以解决这个问题。
You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.
我将 Ragel 与离开操作一起使用:
字符串“hello”将调用
操作中的代码hello
块,然后是action ello
块,最后是action ello2
块。它们的正则表达式非常有限,并且首选机器语言,示例中的大括号仅适用于更通用的语言。
I use Ragel with a leaving action:
The string "hello" would call the code in the
action hello
block, then in theaction ello
block and lastly in theaction ello2
block.Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.
尝试将它们组合成一个大的正则表达式?
Try combining them into one big regex?
我认为简短的答案是,是的,有一种方法可以做到这一点,并且它在计算机科学中是众所周知的,但我不记得它是什么。
简而言之,您可能会发现您的正则表达式解释器在 | 在一起时已经有效地处理了所有这些,或者您可能会找到一个可以这样做的解释器。 如果没有,那么您就该去谷歌搜索字符串匹配和搜索算法了。
I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.
最快的方法似乎是这样的(代码是 C#):
哦,你的意思是最快的代码? 我不知道然后....
The fastest way to do it seems to be something like this (code is C#):
Oh, you meant the fastest code? i don't know then....