重复序列的正则表达式

发布于 2024-12-21 19:24:54 字数 518 浏览 0 评论 0原文

我想匹配由逗号分隔的三字符字母序列(仅允许字母“a”、“b”、“c”)(最后一组不以逗号结尾)。

示例:

abc,bca,cbb
ccc,abc,aab,baa
bcb

我编写了以下正则表达式:

re.match('([abc][abc][abc],)+', "abc,defx,df")

但是它无法正常工作,因为对于上面的示例:

>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False

它似乎只检查第一组三个字母,但忽略其余部分。如何正确书写这个正则表达式?

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).

Examples:

abc,bca,cbb
ccc,abc,aab,baa
bcb

I have written following regular expression:

re.match('([abc][abc][abc],)+', "abc,defx,df")

However it doesn't work correctly, because for above example:

>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False

It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

烟若柳尘 2024-12-28 19:24:54

尝试使用以下正则表达式:

^[abc]{3}(,[abc]{3})*$

^...$ 从字符串的开头到结尾
[...] 给定字符之一
...{3} 之前的短语的三倍
(...)* 0 到 n 次括号中的字符

Try following regex:

^[abc]{3}(,[abc]{3})*$

^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets

遗失的美好 2024-12-28 19:24:54

你要求它用你的正则表达式找到的是“至少一个字母 a、b、c 的三元组”——这就是“+”给你的。之后发生的任何事情对于正则表达式来说并不重要。您可能需要包含“$”,这意味着“行尾”,以确保该行必须全部包含允许的三元组。然而,在当前的形式中,您的正则表达式还要求最后一个三元组以逗号结尾,因此您应该明确编码,表明事实并非如此。
试试这个:

re.match('([abc][abc][abc],)*([abc][abc][abc])

这会找到任意数量的允许的三元组,后跟逗号(可能为零),然后是不带逗号的三元组,然后是行尾。

编辑:不需要包含“^”(字符串开头)符号,因为 match 方法已经仅在字符串开头检查匹配。

这会找到任意数量的允许的三元组,后跟逗号(可能为零),然后是不带逗号的三元组,然后是行尾。

编辑:不需要包含“^”(字符串开头)符号,因为 match 方法已经仅在字符串开头检查匹配。

What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:

re.match('([abc][abc][abc],)*([abc][abc][abc])

This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.

Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.

This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.

Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.

焚却相思 2024-12-28 19:24:54

强制性的“你不需要正则表达式”解决方案:

all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))

The obligatory "you don't need a regex" solution:

all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
给不了的爱 2024-12-28 19:24:54

您需要迭代找到的值的序列。

data_string = "abc,bca,df"

imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)

for match in imatch:
    print match.group('value')

因此,检查字符串是否与模式匹配的正则表达式将是

data_string = "abc,bca,df"

match = re.match(r'^([abc]{3}(,|$))+', data_string)

if match:
    print "data string is correct"

You need to iterate over sequence of found values.

data_string = "abc,bca,df"

imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)

for match in imatch:
    print match.group('value')

So the regex to check if the string matches pattern will be

data_string = "abc,bca,df"

match = re.match(r'^([abc]{3}(,|$))+', data_string)

if match:
    print "data string is correct"
请你别敷衍 2024-12-28 19:24:54

您的结果并不令人惊讶,因为正则表达式

([abc][abc][abc],)+

尝试匹配包含三个字符 [abc] 的字符串,后跟字符串中任意位置 的逗号一次或多次。因此,最重要的部分是确保字符串中没有其他内容 - 正如 scessor 建议添加 ^ (字符串开头)和 $ (字符串结尾)到正则表达式。

Your result is not surprising since the regular expression

([abc][abc][abc],)+

tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.

天邊彩虹 2024-12-28 19:24:54

要重复一系列模式,您需要使用 非捕获组,类似 (?:...) 的结构,并在右括号后应用量词。左括号后面的问号和冒号是创建

例如:

  • (?:abc)+ 匹配 abcabcabcabcabcabc 等字符串。
  • (?:\d+\.){3} 匹配 1.12.2.000.00000.0. 等字符串。

在这里,您可以使用

^[abc]{3}(?:,[abc]{3})*$
          ^^

请注意,使用捕获组在许多 Python 正则表达式方法中充满了不受欢迎的效果。请参阅 re.findall 行为怪异 帖子中描述的经典问题,例如,其中 re.findall 以及在幕后使用此函数的所有其他正则表达式方法仅在模式中存在捕获组时返回捕获的子字符串。

在 Pandas 中,当您只需要对模式序列进行分组时,使用非捕获组也很重要:Series.str.contains抱怨此模式有匹配组。要实际获取组,请使用 str.extract. 和
Series.str.extractSeries.str.extractallSeries.str.findall 将表现为 re.findall代码>.

To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).

For example:

  • (?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
  • (?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.

Here, you can use

^[abc]{3}(?:,[abc]{3})*$
          ^^

Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.

In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

一身骄傲 2024-12-28 19:24:54

不使用正则表达式的替代方案(尽管是暴力方式):

>>> def matcher(x):
        total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
            for i in x.split(','):
                if i not in total:
                    return False
         return True

>>> matcher("abc,bca,aaa")
    True
>>> matcher("abc,bca,xyz")
    False
>>> matcher("abc,aaa,bb")
    False

An alternative without using regex (albeit a brute force way):

>>> def matcher(x):
        total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
            for i in x.split(','):
                if i not in total:
                    return False
         return True

>>> matcher("abc,bca,aaa")
    True
>>> matcher("abc,bca,xyz")
    False
>>> matcher("abc,aaa,bb")
    False
初见你 2024-12-28 19:24:54

如果您的目标是验证字符串是否由字母 a、b 和 c 三元组组成:

for ss in ("abc,bbc,abb,baa,bbb",
           "acc",
           "abc,bbc,abb,bXa,bbb",
           "abc,bbc,ab,baa,bbb"):
    print ss,'   ',bool(re.match('([abc]{3},?)+\Z',ss))

结果

abc,bbc,abb,baa,bbb     True
acc     True
abc,bbc,abb,bXa,bbb     False
abc,bbc,ab,baa,bbb     False

\Z 表示:字符串的结尾。它的存在迫使匹配一直持续到字符串的末尾

顺便说一句,我也喜欢 Sonya 的形式,在某种程度上它更清晰:

bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))

If your aim is to validate a string as being composed of triplet of letters a,b,and c:

for ss in ("abc,bbc,abb,baa,bbb",
           "acc",
           "abc,bbc,abb,bXa,bbb",
           "abc,bbc,ab,baa,bbb"):
    print ss,'   ',bool(re.match('([abc]{3},?)+\Z',ss))

result

abc,bbc,abb,baa,bbb     True
acc     True
abc,bbc,abb,bXa,bbb     False
abc,bbc,ab,baa,bbb     False

\Z means: the end of the string. Its presence obliges the match to be until the very end of the string

By the way, I like the form of Sonya too, in a way it is clearer:

bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文