如何从给定的字符串列表自动生成正则表达式?

发布于 2024-10-15 18:19:49 字数 432 浏览 7 评论 0原文

给您 2 个字符串列表 - A 和 B。找到与 A 中的所有字符串匹配且 B 中没有字符串匹配的最短正则表达式。请注意,此正则表达式可以匹配/不匹配不在 A 和 B 中的其他字符串。简单起见,我们可以假设我们的字母表大小只有 2 个字符 - 0 和 1。而且只允许使用以下运算符:

* - 0 或更多
? - 0 或 1
+ - 1 个或多个
() - 括号

为简单起见,不允许使用正则表达式 not 运算符。我不知道允许或运算符 (|) 是否会简化问题。 A 和 B 当然没有共同元素。以下是一些示例:

A=[00,01,10]
B=[11]
answer = 1*0+1*


A=[00,01,11]
B=[10]
answer = 0*1*

You are given 2 lists of Strings - A and B. Find the shortest regex that matches all strings in A and none in B. Note that this regex can match/not-match other strings that are not in A and not in B. For simplicity, we can assume the that our alphabet size is just 2 characters - 0 and 1. Also only these operators are allowed:

* - 0 or more
? - 0 or 1
+ - 1 or more
() - brackets

For simplicity the regex not operator is not allowed. I don't know if allowing the or operator (|) would simplify the problem or not. A and B ofcourse would have no common elements. Here are some examples:

A=[00,01,10]
B=[11]
answer = 1*0+1*

A=[00,01,11]
B=[10]
answer = 0*1*

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

三寸金莲 2024-10-22 18:19:49

解决这个问题的一种方法是使用遗传算法。我碰巧有一个遗传求解器,所以我应用了它使用以下算法解决您的问题:

  • 时,从所需的输入中获取不同的标记
  • 基因
  • 当基因将正则表达式特殊项添加到适应度算法的
    • 确保生成的字符串是有效的正则表达式
    • 根据它匹配的所需事物的数量来获取适合度值,并且
      它匹配了多少不需要的东西
  • 在找到成功的正则表达式之前,
    • 从不同标记的数量开始,并根据需要递增
    • 尝试生成满足适应性要求的长度的正则表达式

这是我在 C# 中的实现

private static void GenerateRegex(IEnumerable<string> target, IEnumerable<string> dontMatch)
{
    string distinctSymbols = new String(target.SelectMany(x => x).Distinct().ToArray());
    string genes = distinctSymbols + "?*()+";

    Func<string, uint> calcFitness = str =>
        {
            if (str.Count(x => x == '(') != str.Count(x => x == ')'))
            {
                return Int32.MaxValue;
            }
            if ("?*+".Any(x => str[0] == x))
            {
                return Int32.MaxValue;
            }
            if ("?*+?*+".ToArray().Permute(2)
                .Any(permutation => str.IndexOf(new string(permutation.ToArray())) != -1))
            {
                return Int32.MaxValue;
            }
            Regex regex;
            try
            {
                regex = new Regex("^" + str + "$");
            }
            catch (Exception)
            {
                return Int32.MaxValue;
            }
            uint fitness = target.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 0U : 1));
            uint nonFitness = dontMatch.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 10U : 0));
            return fitness + nonFitness;
        };

    for (int targetGeneLength = distinctSymbols.Length; targetGeneLength < genes.Length * 2; targetGeneLength++)
    {
        string best = new GeneticSolver(50).GetBestGenetically(targetGeneLength, genes, calcFitness, true);
        if (calcFitness(best) != 0)
        {
            Console.WriteLine("-- not solved with regex of length " + targetGeneLength);
            continue;
        }
        Console.WriteLine("solved with: " + best);
        break;
    }
}

及其应用于示例的结果:

public void Given_Sample_A()
{
    var target = new[] { "00", "01", "10" };
    var dontMatch = new[] { "11" };

    GenerateRegex(target, dontMatch);
}

输出:

Generation  1 best: 10 (2)
Generation  2 best: 0+ (2)
Generation  5 best: 0* (2)
Generation  8 best: 00 (2)
Generation  9 best: 01 (2)
-- not solved with regex of length 2
Generation  1 best: 10* (2)
Generation  3 best: 00* (2)
Generation  4 best: 01+ (2)
Generation  6 best: 10+ (2)
Generation  9 best: 00? (2)
Generation 11 best: 00+ (2)
Generation 14 best: 0?1 (2)
Generation 21 best: 0*0 (2)
Generation 37 best: 1?0 (2)
Generation 43 best: 10? (2)
Generation 68 best: 01* (2)
Generation 78 best: 1*0 (2)
Generation 79 best: 0*1 (2)
Generation 84 best: 0?0 (2)
Generation 127 best: 01? (2)
Generation 142 best: 0+1 (2)
Generation 146 best: 0+0 (2)
Generation 171 best: 1+0 (2)
-- not solved with regex of length 3
Generation  1 best: 1*0+ (1)
Generation  2 best: 0+1* (1)
Generation 20 best: 1?0+ (1)
Generation 31 best: 1?0* (1)
-- not solved with regex of length 4
Generation  1 best: 1*00? (1)
Generation  2 best: 0*1?0 (1)
Generation  3 best: 1?0?0 (1)
Generation  4 best: 1?00? (1)
Generation  8 best: 1?00* (1)
Generation 12 best: 1*0?0 (1)
Generation 13 best: 1*00* (1)
Generation 41 best: 0*10* (1)
Generation 44 best: 1*0*0 (1)
-- not solved with regex of length 5
Generation  1 best: 0+(1)? (1)
Generation 36 best: 0+()1? (1)
Generation 39 best: 0+(1?) (1)
Generation 61 best: 1*0+1? (0)
solved with: 1*0+1?

第二个示例:

public void Given_Sample_B()
{
    var target = new[] { "00", "01", "11" };
    var dontMatch = new[] { "10" };

    GenerateRegex(target, dontMatch);
}

输出:

Generation  1 best: 00 (2)
Generation  2 best: 01 (2)
Generation  7 best: 0* (2)
Generation 12 best: 0+ (2)
Generation 33 best: 1+ (2)
Generation 36 best: 1* (2)
Generation 53 best: 11 (2)
-- not solved with regex of length 2
Generation  1 best: 00* (2)
Generation  2 best: 0+0 (2)
Generation  7 best: 0+1 (2)
Generation 12 best: 00? (2)
Generation 15 best: 01* (2)
Generation 16 best: 0*0 (2)
Generation 19 best: 01+ (2)
Generation 30 best: 0?0 (2)
Generation 32 best: 0*1 (2)
Generation 42 best: 11* (2)
Generation 43 best: 1+1 (2)
Generation 44 best: 00+ (2)
Generation 87 best: 01? (2)
Generation 96 best: 0?1 (2)
Generation 125 best: 11? (2)
Generation 126 best: 1?1 (2)
Generation 135 best: 11+ (2)
Generation 149 best: 1*1 (2)
-- not solved with regex of length 3
Generation  1 best: 0*1* (0)
solved with: 0*1*

One way to solve this is with a genetic algorithm. I happen to have a genetic solver laying around so I applied it to your problem with the following algorithm:

  • get the distinct tokens from the desired inputs as genes
  • add the Regex specials to the genes
  • for the fitness algorithm
    • make sure the generated string is a valid regex
    • get a fitness value based on how many desired things it matches and
      how many undesired things it matches
  • until a successful Regex is found
    • starting at the number of distinct tokens and incrementing as necessary
    • try to generate a Regex of that length that passes the fitness requirement

Here's my implementation in C#

private static void GenerateRegex(IEnumerable<string> target, IEnumerable<string> dontMatch)
{
    string distinctSymbols = new String(target.SelectMany(x => x).Distinct().ToArray());
    string genes = distinctSymbols + "?*()+";

    Func<string, uint> calcFitness = str =>
        {
            if (str.Count(x => x == '(') != str.Count(x => x == ')'))
            {
                return Int32.MaxValue;
            }
            if ("?*+".Any(x => str[0] == x))
            {
                return Int32.MaxValue;
            }
            if ("?*+?*+".ToArray().Permute(2)
                .Any(permutation => str.IndexOf(new string(permutation.ToArray())) != -1))
            {
                return Int32.MaxValue;
            }
            Regex regex;
            try
            {
                regex = new Regex("^" + str + "$");
            }
            catch (Exception)
            {
                return Int32.MaxValue;
            }
            uint fitness = target.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 0U : 1));
            uint nonFitness = dontMatch.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 10U : 0));
            return fitness + nonFitness;
        };

    for (int targetGeneLength = distinctSymbols.Length; targetGeneLength < genes.Length * 2; targetGeneLength++)
    {
        string best = new GeneticSolver(50).GetBestGenetically(targetGeneLength, genes, calcFitness, true);
        if (calcFitness(best) != 0)
        {
            Console.WriteLine("-- not solved with regex of length " + targetGeneLength);
            continue;
        }
        Console.WriteLine("solved with: " + best);
        break;
    }
}

And the result of its application to your samples:

public void Given_Sample_A()
{
    var target = new[] { "00", "01", "10" };
    var dontMatch = new[] { "11" };

    GenerateRegex(target, dontMatch);
}

output:

Generation  1 best: 10 (2)
Generation  2 best: 0+ (2)
Generation  5 best: 0* (2)
Generation  8 best: 00 (2)
Generation  9 best: 01 (2)
-- not solved with regex of length 2
Generation  1 best: 10* (2)
Generation  3 best: 00* (2)
Generation  4 best: 01+ (2)
Generation  6 best: 10+ (2)
Generation  9 best: 00? (2)
Generation 11 best: 00+ (2)
Generation 14 best: 0?1 (2)
Generation 21 best: 0*0 (2)
Generation 37 best: 1?0 (2)
Generation 43 best: 10? (2)
Generation 68 best: 01* (2)
Generation 78 best: 1*0 (2)
Generation 79 best: 0*1 (2)
Generation 84 best: 0?0 (2)
Generation 127 best: 01? (2)
Generation 142 best: 0+1 (2)
Generation 146 best: 0+0 (2)
Generation 171 best: 1+0 (2)
-- not solved with regex of length 3
Generation  1 best: 1*0+ (1)
Generation  2 best: 0+1* (1)
Generation 20 best: 1?0+ (1)
Generation 31 best: 1?0* (1)
-- not solved with regex of length 4
Generation  1 best: 1*00? (1)
Generation  2 best: 0*1?0 (1)
Generation  3 best: 1?0?0 (1)
Generation  4 best: 1?00? (1)
Generation  8 best: 1?00* (1)
Generation 12 best: 1*0?0 (1)
Generation 13 best: 1*00* (1)
Generation 41 best: 0*10* (1)
Generation 44 best: 1*0*0 (1)
-- not solved with regex of length 5
Generation  1 best: 0+(1)? (1)
Generation 36 best: 0+()1? (1)
Generation 39 best: 0+(1?) (1)
Generation 61 best: 1*0+1? (0)
solved with: 1*0+1?

second sample:

public void Given_Sample_B()
{
    var target = new[] { "00", "01", "11" };
    var dontMatch = new[] { "10" };

    GenerateRegex(target, dontMatch);
}

output:

Generation  1 best: 00 (2)
Generation  2 best: 01 (2)
Generation  7 best: 0* (2)
Generation 12 best: 0+ (2)
Generation 33 best: 1+ (2)
Generation 36 best: 1* (2)
Generation 53 best: 11 (2)
-- not solved with regex of length 2
Generation  1 best: 00* (2)
Generation  2 best: 0+0 (2)
Generation  7 best: 0+1 (2)
Generation 12 best: 00? (2)
Generation 15 best: 01* (2)
Generation 16 best: 0*0 (2)
Generation 19 best: 01+ (2)
Generation 30 best: 0?0 (2)
Generation 32 best: 0*1 (2)
Generation 42 best: 11* (2)
Generation 43 best: 1+1 (2)
Generation 44 best: 00+ (2)
Generation 87 best: 01? (2)
Generation 96 best: 0?1 (2)
Generation 125 best: 11? (2)
Generation 126 best: 1?1 (2)
Generation 135 best: 11+ (2)
Generation 149 best: 1*1 (2)
-- not solved with regex of length 3
Generation  1 best: 0*1* (0)
solved with: 0*1*
放肆 2024-10-22 18:19:49

如果这是一个家庭作业问题,那就像是“一份家庭作业,全班得 A”类型。
我认为该问题中的某处缺少“或”运算符。

有一个明显的解决方案是 A0|A1|A2|...,但在尝试找到最短的时似乎更难。

我建议使用递归来尝试缩短正则表达式,但这不是一个理想的解决方案。

If this was a homework problem, it would be like "one homework, get an A in the class" type.
I think there is "or" operator missing somewhere in that question.

There is an obvious solution that is A0|A1|A2|..., but seems like much harder solution when trying to find the shortest.

I would suggest using recursion to try to shorten the regex, but that is not an ideal solution.

演多会厌 2024-10-22 18:19:49

该项目从给定的单词列表生成一个正则表达式:
https://github.com/bwagner/wordhierarchy

但是,它只使用“|”、非捕获组“(?:)”和选项“?”。

使用示例:

java -jar dist/wordhierarchy.jar 00 01 10
-> 10|0(?:1|0)

java -jar dist/wordhierarchy.jar 00 01 11
-> 0(?:0|1)|11

java -jar dist/wordhierarchy.jar 000 001 010 011 100 101 110 111
-> 1(?:0(?:0|1)|1(?:0|1))|0(?:1(?:1|0)|0(?:1|0))

java -jar dist/wordhierarchy.jar 000 001 010     100 101 110 111
-> 1(?:0(?:0|1)|1(?:1|0))|0(?:10|0(?:1|0))

This project generates a regexp from a given list of words:
https://github.com/bwagner/wordhierarchy

However, it only uses "|", non-capturing group "(?:)" and option "?".

Sample usage:

java -jar dist/wordhierarchy.jar 00 01 10
-> 10|0(?:1|0)

java -jar dist/wordhierarchy.jar 00 01 11
-> 0(?:0|1)|11

java -jar dist/wordhierarchy.jar 000 001 010 011 100 101 110 111
-> 1(?:0(?:0|1)|1(?:0|1))|0(?:1(?:1|0)|0(?:1|0))

java -jar dist/wordhierarchy.jar 000 001 010     100 101 110 111
-> 1(?:0(?:0|1)|1(?:1|0))|0(?:10|0(?:1|0))
抠脚大汉 2024-10-22 18:19:49

“当有疑问时,请使用暴力。”

import re

def learn(ayes, noes, max_size=7):
    def is_ok(rx):
        rx += '

这会产生与第一个不同但同样好的答案:0*1?0*。它查看 1241 个试验正则表达式来解决两个测试用例(总计)。

搜索有大小限制——因为这个问题的通用正则表达式版本是 NP 困难的,任何针对它的程序都会在足够复杂的输入上遇到麻烦。我承认自己没有真正考虑过这个简化的问题。我很想看到一些简洁的不太明显的答案。

return (all(re.match(rx, s) for s in ayes) and not any(re.match(rx, s) for s in noes)) return find(find(gen_sized(size), is_ok) for size in range(max_size + 1)) def find(xs, ok=lambda x: x): for x in xs: if ok(x): return x def gen_sized(size): if 0 == size: yield '' if 0 < size: for rx in gen_sized(size-1): yield rx + '0' yield rx + '1' if rx and rx[-1] not in '*?+': yield rx + '*' yield rx + '?' yield rx + '+' if 5 < size: for rx in gen_sized(size-3): yield '(%s)*' % rx yield '(%s)?' % rx yield '(%s)+' % rx

这会产生与第一个不同但同样好的答案:0*1?0*。它查看 1241 个试验正则表达式来解决两个测试用例(总计)。

搜索有大小限制——因为这个问题的通用正则表达式版本是 NP 困难的,任何针对它的程序都会在足够复杂的输入上遇到麻烦。我承认自己没有真正考虑过这个简化的问题。我很想看到一些简洁的不太明显的答案。

"When in doubt, use brute force."

import re

def learn(ayes, noes, max_size=7):
    def is_ok(rx):
        rx += '

This produces a different but equally good answer for the first one: 0*1?0*. It looks at 1241 trial regexes to solve the two test cases (total).

The search has a size limit -- since the general-regex version of this problem is NP-hard, any program for it is going to run into trouble on complex-enough inputs. I'll cop to not having really thought about this simplified problem. I'd love to see some neat less-obvious answers.

return (all(re.match(rx, s) for s in ayes) and not any(re.match(rx, s) for s in noes)) return find(find(gen_sized(size), is_ok) for size in range(max_size + 1)) def find(xs, ok=lambda x: x): for x in xs: if ok(x): return x def gen_sized(size): if 0 == size: yield '' if 0 < size: for rx in gen_sized(size-1): yield rx + '0' yield rx + '1' if rx and rx[-1] not in '*?+': yield rx + '*' yield rx + '?' yield rx + '+' if 5 < size: for rx in gen_sized(size-3): yield '(%s)*' % rx yield '(%s)?' % rx yield '(%s)+' % rx

This produces a different but equally good answer for the first one: 0*1?0*. It looks at 1241 trial regexes to solve the two test cases (total).

The search has a size limit -- since the general-regex version of this problem is NP-hard, any program for it is going to run into trouble on complex-enough inputs. I'll cop to not having really thought about this simplified problem. I'd love to see some neat less-obvious answers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文