生成最短的正则表达式来匹配任意单词列表

发布于 2024-12-04 18:30:03 字数 216 浏览 1 评论 0原文

我希望有人可能知道一个脚本,它可以接受任意单词列表并生成可以完全匹配该列表的最短正则表达式(而不是其他)。

例如,假设我的列表是

1231
1233
1234
1236
1238
1247
1256
1258
1259

那么输出应该是:

12(3[13468]|47|5[589])

I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).

For example, suppose my list is

1231
1233
1234
1236
1238
1247
1256
1258
1259

Then the output should be:

12(3[13468]|47|5[589])

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

南汐寒笙箫 2024-12-11 18:30:03

您最好保存整个列表,或者如果您想变得更奇特,请创建一个 Trie

1231
1234
1247

    1
    |
    2 
   / \
  3   4
 / \   \
1   4   7

现在,当您获取字符串时,检查它是否到达叶节点。确实如此,这是有效的。

如果您有可变长度的重叠字符串(例如:123 和 1234),您需要将一些节点标记为可能的终端。


如果您确实喜欢正则表达式的想法,您还可以使用 trie 生成正则表达式:

  1. 从根到第一个分支的节点是固定的(例如:12

  2. 分支创建|:(例如:12(3|4)代码>

  3. 叶子节点生成跟随父节点的字符类(或单个字符):(例如<代码>12(3[14]|47))

这可能不会生成最短的正则表达式,为此,您需要可能需要一些额外的工作:

  1. 如果找到它们,则为“紧凑”范围(例如,[12345] 变为 [1-4]

  2. 为重复元素添加量词(例如:[1234 ][1234] 变为 [1234]{2}

  3. ???

我真的不认为生成正则表达式值得。

You are probably better off saving the entire list, or if you want to get fancy, create a Trie:

1231
1234
1247

    1
    |
    2 
   / \
  3   4
 / \   \
1   4   7

Now when you take a string check if it reaches a leaf node. It does, it's valid.

If you have variable length overlapping strings (eg: 123 and 1234) you'll need to mark some nodes as possibly terminal.


You can also use the trie to generate the regex if you really like the regex idea:

  1. Nodes from the root to the first branching are fixed (eg: 12)

  2. Branches create |: (eg: 12(3|4)

  3. Leaf nodes generate a character class (or single character) that follows the parent node: (eg 12(3[14]|47))

This might not generate the shortest regex, to do that you'll might some extra work:

  1. "Compact" ranges if you find them (eg [12345] becomes [1-4])

  2. Add quantifiers for repeated elements (eg: [1234][1234] becomes [1234]{2}

  3. ???

I really don't think it's worth it to generate the regex.

梦巷 2024-12-11 18:30:03

这是一篇旧文章,但为了那些像我一样通过网络搜索找到它的人的利益,有一个 Perl 模块可以执行此操作,称为 Regexp::Optimizer,此处:http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm

它需要一个正则表达式作为输入,它可以只包含用 | 分隔的输入字符串列表,并输出最佳正则表达式。

例如,这个 Perl 命令行:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"

生成以下输出:(

(?^:(?^:12(?:3[13468]|5[689]|47)))

假设您已经安装了 Regex::Optimizer),它非常符合 OP 的期望。

这是另一个示例:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"

输出:

(?^:(?^:3(?:[1238]|57)4))

为了进行比较,基于 trie 的最佳版本将输出 3(14|24|34|574|84)。在上面的输出中,您还可以搜索并替换 (?:(?^:( 并消除多余的括号,以获得这:

3([1238]|57)4

This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm

It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.

For example, this Perl command-line:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"

generates this output:

(?^:(?^:12(?:3[13468]|5[689]|47)))

(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.

Here's another example:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"

And the output:

(?^:(?^:3(?:[1238]|57)4))

For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:

3([1238]|57)4
留一抹残留的笑 2024-12-11 18:30:03

该项目从给定的单词列表生成正则表达式: https://github.com/bwagner/wordhierarchy

它几乎与上述 JavaScript 解决方案相同,但避免了某些多余的括号。
它仅使用“|”、非捕获组“(?:)”和选项“?”。

它最适合具有从左侧开始的公共子字符串的单词。可以改进以在任何地方处理子字符串。

生成的正则表达式可以轻松适应其他正则表达式方言。

示例用法:

java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:3[36148]|47|5[698])

我应该提到我是这个项目的作者。

This project generates a regexp from a given list of words: https://github.com/bwagner/wordhierarchy

It almost does the same as the above JavaScript solution, but avoids certain superfluous parentheses.
It only uses "|", non-capturing group "(?:)" and option "?".

It works best for words with common substrings from the left. Could be improved to work with substrings anywhere.

The generated regexp could easily be adapted to other regexp dialects.

Sample usage:

java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:3[36148]|47|5[698])

I should mention I'm the author of this project.

魂ガ小子 2024-12-11 18:30:03

这是我想到的(JavaScript)。它将 20,000 个 6 位数字的列表转换为 60,000 个字符的正则表达式。与简单的 (word1|word2|...) 结构相比,按字符数计算,这几乎是 60% 的“压缩”。

我将这个问题悬而未决,因为仍有很大的改进空间,并且我希望可能有更好的工具。

var list = new listChar("");

function listChar(s, p) {
    this.char = s;
    this.depth = 0;
    this.parent = p;
    this.add = function(n) {
        if (!this.subList) {
            this.subList = {};
            this.increaseDepth();
        }
        if (!this.subList[n]) {
            this.subList[n] = new listChar(n, this);
        }
        return this.subList[n];
    }
    this.toString = function() {
        var ret = "";
        var subVals = [];
        if (this.depth >=1) {
            for (var i in this.subList) {
                subVals[subVals.length] = this.subList[i].toString();
            }
        }
        if (this.depth === 1 && subVals.length > 1) {
            ret = "[" + subVals.join("") + "]";
        } else if (this.depth === 1 && subVals.length === 1) {
            ret = subVals[0];
        } else if (this.depth > 1) {
            ret = "(" + subVals.join("|") + ")";
        }
        return this.char + ret;
    }
    this.increaseDepth = function() {
        this.depth++;
        if (this.parent) {
            this.parent.increaseDepth();
        }
    }
}

function wordList(input) {
    var listStep = list;
    while (input.length > 0) {
        var c = input.charAt(0);
        listStep = listStep.add(c);
        input = input.substring(1);
    }
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
    wordList(words[i]);
}

document.write(list.toString());

使用

words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];

这里是输出:

(1(2(3[13468]|47|5[689])))

Here's what I came up with (JavaScript). It turned a list of 20,000 6-digit numbers into a 60,000-character regular expression. Compared to a naive (word1|word2|...) construction, that's almost 60% "compression" by character count.

I'm leaving the question open, as there's still a lot of room for improvement and I'm holding out hope that there might be a better tool out there.

var list = new listChar("");

function listChar(s, p) {
    this.char = s;
    this.depth = 0;
    this.parent = p;
    this.add = function(n) {
        if (!this.subList) {
            this.subList = {};
            this.increaseDepth();
        }
        if (!this.subList[n]) {
            this.subList[n] = new listChar(n, this);
        }
        return this.subList[n];
    }
    this.toString = function() {
        var ret = "";
        var subVals = [];
        if (this.depth >=1) {
            for (var i in this.subList) {
                subVals[subVals.length] = this.subList[i].toString();
            }
        }
        if (this.depth === 1 && subVals.length > 1) {
            ret = "[" + subVals.join("") + "]";
        } else if (this.depth === 1 && subVals.length === 1) {
            ret = subVals[0];
        } else if (this.depth > 1) {
            ret = "(" + subVals.join("|") + ")";
        }
        return this.char + ret;
    }
    this.increaseDepth = function() {
        this.depth++;
        if (this.parent) {
            this.parent.increaseDepth();
        }
    }
}

function wordList(input) {
    var listStep = list;
    while (input.length > 0) {
        var c = input.charAt(0);
        listStep = listStep.add(c);
        input = input.substring(1);
    }
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
    wordList(words[i]);
}

document.write(list.toString());

Using

words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];

Here's the output:

(1(2(3[13468]|47|5[689])))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文