如何找到“最小生成集”正则表达式的集合?
上下文:
我有一个规模较小(目前不到 100 个)但不断增长的正则表达式集合,我想优化确定给定文本字符串的过程,我的集合中的哪些 RE 与文本匹配细绳。
一些 RE 具有排序关系 - 例如,如果我知道字符串 $t 与 /windows/i 匹配,那么我也知道 $t 与 /windows.*2000/i 匹配。因此,当针对我的集合中的 RE 测试 $t 时,如果我已经针对 /windows.*2000/i 测试了 $t 并找到了匹配项(尽管如果 /windows.*2000/i 确实如此),我可以跳过测试 /windows/i 不匹配,那么我当然不能跳过对/windows/i的测试。
请注意,我的集合中没有一个 RE 是完全等效的(对于任何一对 RE,至少有一个文本字符串与其中一个匹配,但不与另一个匹配)。
策略:
我想构建一个有向图 G,其中包含我集合中每个 RE 的节点,以及具有排序关系的每对 RE 的有向边(A -> B 表示“与 A 匹配”)意味着与 B 匹配”),并找到图的节点的“最小生成集”(节点 S 的最小集,使得 G 中的每个节点都位于源自 S 的有向路径上)。
简单部分:
有很多免费可用的算法可用于处理有向无环图。因此,一旦为我的 RE 集合构建了图 G(不同的 RE 集合应该保证 G 是非循环的),我预计找到合适的算法来查找 G 的最小生成集不会有太大困难
。需要帮助:
我想找到一种有效的方法来查找我的集合中 RE 之间的所有排序关系 - 也许还可以确保集合中没有两个 RE 是等效的(我需要一种方法添加新 RE 时自动验证这一点)。
因此,我的(基本上是随机的)网络搜索至少发现了一个合理的主张,即确实存在计算两个 RE 之间存在的(如果有的话)排序关系的合理方法,但尚未找到完整算法的任何描述。
有谁知道现有的实现(用于比较 RE),它相当高效、免费可用,并且(理想情况下)用一种流行的脚本语言或 C/C++ 实现?
CONTEXT:
I have a smallish (currently less than 100) but growing collection of Regular Expressions, and I want to optimize the process of determining for a given text string which of the REs in my collection match the text string.
Some of the REs have an ordering relationship - for example if I know that the string $t matches /windows/i then I also know that $t matches /windows.*2000/i. So when testing $t against the REs in my collection I can skip testing /windows/i if I've already tested $t against /windows.*2000/i and found a match (although if /windows.*2000/i does not match then of course I cannot skip the test against /windows/i).
Note that none of the REs in my collection are entirely equivalent (for any pair of REs there is at least one text string which matches one and does not match the other).
STRATEGY:
I want to build a directed graph G with a node for each RE in my collection and a directed edge for each pair of REs with an ordering relationship (A -> B means "match against A implies match against B"), and find a "minimal spanning set" of nodes for the graph (minimal set of nodes S such that every node in G lies on a directed path which originates in S).
THE EASY PART:
There are lots of freely available algorithms for working with Directed Acyclic Graphs. So once the graph G is built for my collection of REs (which being distinct should guarantee that G is acyclic) I don't expect to have much difficulty finding an appropriate algorithm for finding a minimal spanning set for G.
WHERE I NEED HELP:
I'd like to find an efficient way to find all the ordering relationships between the REs in my collection - and perhaps also to ensure that no two REs in the collection are equivalent (I will need a way to automatically verify this as new REs are added).
My (essentially random) web searches have thus turned up at least one plausible claim that a reasonable way to compute what (if any) ordering relationship exists between two REs does indeed exist, but have not yet turned up any descriptions of a complete algorithm.
Does anyone know of an existing implementation (for comparing REs) which is reasonably efficient, freely available, and (ideally) implemented either in one of the popular scripting languages or C/C++?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不确定您在需要使用的正则表达式库方面是否具有灵活性,但您可以查看 RE2其Set接口可以同时匹配多个正则表达式。请注意,RE2 主要使用 DFA 方法,并且不支持其他实现(主要是回溯)所支持的所有正则表达式功能。
I am not sure if you have flexibility in terms of the regular expression library that you need to use, but you could look at RE2 whose Set interface can match multiple regexes simultaneously. Note that RE2 uses primarily a DFA approach, and does not support all of the regex features that other, mostly backtracking, implementations do.