正则表达式匹配给定集合的任何子集？

发布于 2024-12-01 14:36:31 字数 518 浏览 0 评论 0原文

是否可以编写一个正则表达式来匹配给定字符集的任何子集
a1 ... an ？
即它应该匹配任何这些字符最多出现一次的字符串，没有其他字符并且字符的相对顺序并不重要。

一些立即出现的方法：
1. [a1,...,an]* 或 (a1|a2|...|an)* - 允许出现多个字符
2. (a1?a2?...an?) - 没有多重存在，但相对顺序很重要 - 这匹配任何子序列，但不子集.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1)，即写入所有可能的子序列（只是硬编码所有匹配的字符串:)）当然，这是不可接受的。

我也有一个猜测，这在理论上可能是不可能的，因为在解析字符串的过程中，我们需要记住我们之前已经遇到过哪个字符，并且据我所知，正则表达式只能检查右线性语言。

任何帮助将不胜感激。提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

失而复得 2024-12-08 14:36:31

这并不真正符合与语言无关标签，但是...

^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$

参见ideone.com 上的演示

第一次匹配元素时，它会被后面的捕获组“选中”。因为该组现在已经参与了匹配，所以其相应反向引用的负前瞻（例如，(?!\1)）将永远不会再次匹配，即使该组仅捕获了一个空字符串。这是一个未记录的功能，但许多版本都支持该功能，包括 Java、.NET、Perl、Python 和 Ruby。

此解决方案还需要支持前向引用（即，对出现在正则表达式中组本身之前的给定捕获组 (\1) 的引用）。这似乎比空组噱头得到的支持要少一些。

This doesn't really qualify for the language-agnostic tag, but...

^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$

see a demo on ideone.com

The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.

This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.

回复收藏 0 原文

分开我的手 2024-12-08 14:36:31

无法想象如何使用单个正则表达式来做到这一点，但这是使用 n 正则表达式来做到这一点的一种方法：（我将 usr 1 2 ... m n 等用于您的 as）

^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$

如果以上所有内容都匹配，则您的字符串是 的严格子集12..mn。

这是如何工作的：每一行都要求字符串完全包含：

从集合中提取的任意数量的字符，除了特定的一个
或者特定的一个
字符，除了特定的一个

如果当每个元素依次被视为特定的一个时，我们知道：

从集合中提取的任意数量的除允许的元素外的字符串，
每个允许的元素最多有一个元素

所需的

。为了完整起见，我应该说，只有当我接到“使用正则表达式”的命令时，我才会这样做；如果没有，我会跟踪哪些允许的元素已被看到，并迭代字符串的字符做明显的事情。

Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)

^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$

If all the above match, your string is a strict subset of 12..mn.

How this works: each line requires the string to consist exactly of:

any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one

If this passes when every element in turn is considered as a particular one, we know:

there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements

as required.

for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.

回复收藏 0 原文

故事灯 2024-12-08 14:36:31

不确定您是否可以使用扩展的正则表达式来执行此操作，但通过简单遍历字符串即可轻松完成。

您可以使用散列（或数组，或其他任何形式）来存储字符串中是否已出现或未出现任何允许的字符。然后，您只需迭代字符串的元素即可。如果你遇到一个不在你允许的集合中的元素，你就可以退出。如果允许，但你已经看到了，你也可以退出。

在伪代码中：

foreach char a in {a1, ..., an}
   hit[a1] = false

foreach char c in string
   if c not in {a1, ..., an} => fail
   if hit[c] => fail
   hit[c] = true

Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.

You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.

In pseudo-code:

foreach char a in {a1, ..., an}
   hit[a1] = false

foreach char c in string
   if c not in {a1, ..., an} => fail
   if hit[c] => fail
   hit[c] = true

回复收藏 0 原文

肩上的翅膀 2024-12-08 14:36:31

与 Alan Moore 类似，仅使用 \1，并且在看到之前不引用捕获组：

#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
    print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}

我们匹配任意数量的块（外部 (?:)），其中每个块必须包含“恰好一个字符”来自我们的首选集合，后面没有包含该字符的字符串”。

如果字符串可能包含换行符或其他有趣的内容，则可能需要使用一些标志来制作 ^、$ 和 .按预期行事，但这一切都取决于特定的 RE 风格。

只是为了愚蠢，我们可以使用正向前瞻断言来有效地与两个正则表达式，因此我们可以通过断言上述匹配来测试 abc 的任何排列，然后对 'is N 个字符长并由这些组成人物'：

my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
    print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:

#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
    print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}

We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".

If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.

Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':

my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
    print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

回复收藏 0 原文

~没有更多了~