需要帮助找到正确的正则表达式匹配模式

发布于 2024-12-12 18:03:59 字数 353 浏览 2 评论 0原文

我无法在 python 中找到有效的正则表达式来分割这些字符串:

CAT One | desired: CAT

DOG SILVER FOX Two | desired: DOG SILVER FOX

KING KONG | desired: KING KONG

P'OT THEN Mark First | desired P'OT THEN

只是愚蠢的例子,但我需要将全大写的单词与仅大写的单词分开。

我可以使用 {1,n} 大写单词和 {0,n} 大写单词。

我的正则表达式太奇怪了,我捕获了所有字符串或仅捕获了一个大写单词..

I can't find a working regex in python to split these strings:

CAT One | desired: CAT

DOG SILVER FOX Two | desired: DOG SILVER FOX

KING KONG | desired: KING KONG

P'OT THEN Mark First | desired P'OT THEN

Just stupid examples, but i need to separate words that are full uppercase from words that are only capitalized.

I could have {1,n} uppercase words and {0,n} capitalized words.

My regexs were too weird, i catch all the string or only one uppercase word..

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夜吻♂芭芘 2024-12-19 18:03:59
import re

lines = [
    "CAT One",
    "DOG SILVER FOX Two",
    " KING KONG ",
    "P'OT THEN Mark First",
    "FOO-BAR Second FISH",
    "horsE YELLOW thirD BLUE",
    ]

for line in lines:
    print re.findall(r'\b[A-Z]+(?:\W*[A-Z]+)*\b', line)

输出:

['CAT']
['DOG SILVER FOX']
['KING KONG']
["P'OT THEN"]
['FOO-BAR', 'FISH']
['YELLOW', 'BLUE']

解释:

\b[AZ]+ 表示:匹配一个或多个大写字母,但仅匹配单词的开头。这将匹配“YELLOW”,但不匹配“horsE”中的“E”。

\W*[AZ]+ 表示:匹配零个或多个非单词字符,后跟一个或多个大写字母。这将匹配“'OT”或“-BAR”或“KONG”。

(?:\W*[AZ]+)*\b 表示:创建一个匹配零次或多次的(非捕获)组,但仅在单词末尾。这将匹配“SILVER FOX”,但不匹配其后面的“T”。

import re

lines = [
    "CAT One",
    "DOG SILVER FOX Two",
    " KING KONG ",
    "P'OT THEN Mark First",
    "FOO-BAR Second FISH",
    "horsE YELLOW thirD BLUE",
    ]

for line in lines:
    print re.findall(r'\b[A-Z]+(?:\W*[A-Z]+)*\b', line)

Output:

['CAT']
['DOG SILVER FOX']
['KING KONG']
["P'OT THEN"]
['FOO-BAR', 'FISH']
['YELLOW', 'BLUE']

Explanation:

\b[A-Z]+ means: match one or more capital letters, but only at the start of a word. This will match "YELLOW", but not the "E" in "horsE".

\W*[A-Z]+ means: match zero or more non-word characters, followed by one or more capital letters. This will match "'OT" or "-BAR" or " KONG".

(?:\W*[A-Z]+)*\b means: make a (non-capturing) group which matches zero or more times, but only at the end of a word. This will match " SILVER FOX", but not the " T" which follows it.

怎会甘心 2024-12-19 18:03:59

非正则表达式解决方案:

tests = """\
CAT One
DOG SILVER FOX Two
KING KONG
P'OT THEN Mark First
""".splitlines()

isAllUppercase = lambda s: all(c.upper() == c for c in s)

from itertools import takewhile

for t in tests:
    print t
    print ' '.join(takewhile(isAllUppercase,t.split()))
    print

给出:

CAT One
CAT

DOG SILVER FOX Two
DOG SILVER FOX

KING KONG
KING KONG

P'OT THEN Mark First
P'OT THEN

A non regex solution:

tests = """\
CAT One
DOG SILVER FOX Two
KING KONG
P'OT THEN Mark First
""".splitlines()

isAllUppercase = lambda s: all(c.upper() == c for c in s)

from itertools import takewhile

for t in tests:
    print t
    print ' '.join(takewhile(isAllUppercase,t.split()))
    print

Gives:

CAT One
CAT

DOG SILVER FOX Two
DOG SILVER FOX

KING KONG
KING KONG

P'OT THEN Mark First
P'OT THEN
我三岁 2024-12-19 18:03:59
[^a-z ](?![a-z])| (?![A-Z]?[a-z])

非小写字母或空格后不跟小写字母(因此大写字母加数字加符号)

空格后不跟(可选的大写字母)和小写字母。

不清楚是否应该在前面添加 ^,因为大写单词始终位于前面。

^[^a-z ](?![a-z])| (?![A-Z]?[a-z])

(我们忽略空格作为第一个字符的情况。因此没有 (space)KING KONG。如果您想包含它,请在 之后添加 ^ >|,如 ^ (?![AZ]?[az]))

[^a-z ](?![a-z])| (?![A-Z]?[a-z])

Non-lower case letter or space not followed by lower case letter (so upper case letters plus digits plus symbols)

OR

space not followed by (optional Upper case letter) and lower case letter.

It isn't clear if you should pre-pend a ^ because the upper case words are always first.

^[^a-z ](?![a-z])| (?![A-Z]?[a-z])

(we are ignoring the case of space as a first character here. so no (space)KING KONG. If you want to include it, put a ^ after the |, like ^ (?![A-Z]?[a-z]))

心如荒岛 2024-12-19 18:03:59

你应该能够以消极的眼光来解决这个问题。您扫描大写 NOT 后跟小写

[A-Z']+ ?[A-Z']+?(?![a-z])

[AZ'] 是您要匹配的字符范围,如果您需要更多标点符号,只需 ' 只需将它们添加到此范围。

You should be able to sort this with a negative look ahead. You scan for Uppercase NOT followed by a lowercase

[A-Z']+ ?[A-Z']+?(?![a-z])

[A-Z'] is the range of characters you are matching, if you need more punctuation then just ' simply add them to this range.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文