正则表达式匹配给定集合的任何子集?
是否可以编写一个正则表达式来匹配给定字符集的任何子集a1 ... an
?
即它应该匹配任何这些字符最多出现一次的字符串,没有其他字符并且字符的相对顺序并不重要。
一些立即出现的方法:
1. [a1,...,an]*
或 (a1|a2|...|an)*
- 允许出现多个字符
2. (a1?a2?...an?)
- 没有多重存在,但相对顺序很重要 - 这匹配任何子序列,但不子集.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1)
,即写入所有可能的子序列(只是硬编码所有匹配的字符串:))当然,这是不可接受的。
我也有一个猜测,这在理论上可能是不可能的,因为在解析字符串的过程中,我们需要记住我们之前已经遇到过哪个字符,并且据我所知,正则表达式只能检查右线性语言。
任何帮助将不胜感激。提前致谢。
Is it possible to write a regular expression which will match any subset of a given set of charactersa1 ... an
?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]*
or (a1|a2|...|an)*
- this allows multiple presence of characters
2. (a1?a2?...an?)
- no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1)
, i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这并不真正符合
与语言无关
标签,但是...参见ideone.com 上的演示
第一次匹配元素时,它会被后面的捕获组“选中”。因为该组现在已经参与了匹配,所以其相应反向引用的负前瞻(例如,
(?!\1)
)将永远不会再次匹配,即使该组仅捕获了一个空字符串。这是一个未记录的功能,但许多版本都支持该功能,包括 Java、.NET、Perl、Python 和 Ruby。此解决方案还需要支持前向引用(即,对出现在正则表达式中组本身之前的给定捕获组 (
\1
) 的引用)。这似乎比空组噱头得到的支持要少一些。This doesn't really qualify for the
language-agnostic
tag, but...see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g.,
(?!\1)
) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.This solution also requires support for forward references (i.e., a reference to a given capturing group (
\1
) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.无法想象如何使用单个正则表达式来做到这一点,但这是使用
n
正则表达式来做到这一点的一种方法:(我将 usr1
2 ...
m
n
等用于您的a
s)如果以上所有内容都匹配,则您的字符串是
的严格子集12..mn
。这是如何工作的:每一行都要求字符串完全包含:
特定的一个
特定的一个
特定的一个
如果当每个元素依次被视为
特定的一个
时,我们知道:所需的
。为了完整起见,我应该说,只有当我接到“使用正则表达式”的命令时,我才会这样做;如果没有,我会跟踪哪些允许的元素已被看到,并迭代字符串的字符做明显的事情。
Can't think how to do it with a single regex, but this is one way to do it with
n
regexes: (I will usr1
2
...m
n
etc for youra
s)If all the above match, your string is a strict subset of
12..mn
.How this works: each line requires the string to consist exactly of:
a particular one
a particular one
a particular one
If this passes when every element in turn is considered as
a particular one
, we know:as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
不确定您是否可以使用扩展的正则表达式来执行此操作,但通过简单遍历字符串即可轻松完成。
您可以使用散列(或数组,或其他任何形式)来存储字符串中是否已出现或未出现任何允许的字符。然后,您只需迭代字符串的元素即可。如果你遇到一个不在你允许的集合中的元素,你就可以退出。如果允许,但你已经看到了,你也可以退出。
在伪代码中:
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
与 Alan Moore 类似,仅使用 \1,并且在看到之前不引用捕获组:
我们匹配任意数量的块(外部 (?:)),其中每个块必须包含“恰好一个字符”来自我们的首选集合,后面没有包含该字符的字符串”。
如果字符串可能包含换行符或其他有趣的内容,则可能需要使用一些标志来制作 ^、$ 和 .按预期行事,但这一切都取决于特定的 RE 风格。
只是为了愚蠢,我们可以使用正向前瞻断言来有效地与两个正则表达式,因此我们可以通过断言上述匹配来测试 abc 的任何排列,然后对 'is N 个字符长并由这些组成人物':
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':