需要帮助找到正确的正则表达式匹配模式

发布于 2024-12-12 18:03:59 字数 353 浏览 2 评论 0原文

我无法在 python 中找到有效的正则表达式来分割这些字符串：

CAT One | desired: CAT

DOG SILVER FOX Two | desired: DOG SILVER FOX

KING KONG | desired: KING KONG

P'OT THEN Mark First | desired P'OT THEN

只是愚蠢的例子，但我需要将全大写的单词与仅大写的单词分开。

我可以使用 {1,n} 大写单词和 {0,n} 大写单词。

我的正则表达式太奇怪了，我捕获了所有字符串或仅捕获了一个大写单词..

原文

I can't find a working regex in python to split these strings:

CAT One | desired: CAT

DOG SILVER FOX Two | desired: DOG SILVER FOX

KING KONG | desired: KING KONG

P'OT THEN Mark First | desired P'OT THEN

Just stupid examples, but i need to separate words that are full uppercase from words that are only capitalized.

I could have {1,n} uppercase words and {0,n} capitalized words.

My regexs were too weird, i catch all the string or only one uppercase word..

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜吻♂芭芘 2024-12-19 18:03:59

import re

lines = [
    "CAT One",
    "DOG SILVER FOX Two",
    " KING KONG ",
    "P'OT THEN Mark First",
    "FOO-BAR Second FISH",
    "horsE YELLOW thirD BLUE",
    ]

for line in lines:
    print re.findall(r'\b[A-Z]+(?:\W*[A-Z]+)*\b', line)

输出：

['CAT']
['DOG SILVER FOX']
['KING KONG']
["P'OT THEN"]
['FOO-BAR', 'FISH']
['YELLOW', 'BLUE']

解释：

\b[AZ]+ 表示：匹配一个或多个大写字母，但仅匹配单词的开头。这将匹配“YELLOW”，但不匹配“horsE”中的“E”。

\W*[AZ]+ 表示：匹配零个或多个非单词字符，后跟一个或多个大写字母。这将匹配“'OT”或“-BAR”或“KONG”。

(?:\W*[AZ]+)*\b 表示：创建一个匹配零次或多次的（非捕获）组，但仅在单词末尾。这将匹配“SILVER FOX”，但不匹配其后面的“T”。

import re

lines = [
    "CAT One",
    "DOG SILVER FOX Two",
    " KING KONG ",
    "P'OT THEN Mark First",
    "FOO-BAR Second FISH",
    "horsE YELLOW thirD BLUE",
    ]

for line in lines:
    print re.findall(r'\b[A-Z]+(?:\W*[A-Z]+)*\b', line)

Output:

['CAT']
['DOG SILVER FOX']
['KING KONG']
["P'OT THEN"]
['FOO-BAR', 'FISH']
['YELLOW', 'BLUE']

Explanation:

\b[A-Z]+ means: match one or more capital letters, but only at the start of a word. This will match "YELLOW", but not the "E" in "horsE".

\W*[A-Z]+ means: match zero or more non-word characters, followed by one or more capital letters. This will match "'OT" or "-BAR" or " KONG".

(?:\W*[A-Z]+)*\b means: make a (non-capturing) group which matches zero or more times, but only at the end of a word. This will match " SILVER FOX", but not the " T" which follows it.

回复收藏 0 原文

怎会甘心 2024-12-19 18:03:59

非正则表达式解决方案：

tests = """\
CAT One
DOG SILVER FOX Two
KING KONG
P'OT THEN Mark First
""".splitlines()

isAllUppercase = lambda s: all(c.upper() == c for c in s)

from itertools import takewhile

for t in tests:
    print t
    print ' '.join(takewhile(isAllUppercase,t.split()))
    print

给出：

CAT One
CAT

DOG SILVER FOX Two
DOG SILVER FOX

KING KONG
KING KONG

P'OT THEN Mark First
P'OT THEN

A non regex solution:

tests = """\
CAT One
DOG SILVER FOX Two
KING KONG
P'OT THEN Mark First
""".splitlines()

isAllUppercase = lambda s: all(c.upper() == c for c in s)

from itertools import takewhile

for t in tests:
    print t
    print ' '.join(takewhile(isAllUppercase,t.split()))
    print

Gives:

CAT One
CAT

DOG SILVER FOX Two
DOG SILVER FOX

KING KONG
KING KONG

P'OT THEN Mark First
P'OT THEN

回复收藏 0 原文

我三岁 2024-12-19 18:03:59

[^a-z ](?![a-z])| (?![A-Z]?[a-z])

非小写字母或空格后不跟小写字母（因此大写字母加数字加符号）

或

空格后不跟（可选的大写字母）和小写字母。

不清楚是否应该在前面添加 ^，因为大写单词始终位于前面。

^[^a-z ](?![a-z])| (?![A-Z]?[a-z])

（我们忽略空格作为第一个字符的情况。因此没有 (space)KING KONG。如果您想包含它，请在 之后添加 ^ >|，如 ^ (?![AZ]?[az]))

[^a-z ](?![a-z])| (?![A-Z]?[a-z])

Non-lower case letter or space not followed by lower case letter (so upper case letters plus digits plus symbols)

space not followed by (optional Upper case letter) and lower case letter.

It isn't clear if you should pre-pend a ^ because the upper case words are always first.

^[^a-z ](?![a-z])| (?![A-Z]?[a-z])

(we are ignoring the case of space as a first character here. so no (space)KING KONG. If you want to include it, put a ^ after the |, like ^ (?![A-Z]?[a-z]))

回复收藏 0 原文