我正在尝试创建一个正则表达式,该表达式将识别Python中给定的字符串中可能的缩写。我对Regex是新手,尽管我相信这应该很简单,但我很难创建一个表达。该表达式应该拿起具有两个或更多大写字母的单词。该表达式还应该能够拾起在两者之间使用破折号并报告整个单词的单词(无论是在破折号之前还是之后)。如果也出现数字,也应该用该词进行报告。
因此,它应该接收:
ABC,ABC,ABC,A-ABC,A-ABC,ABC-A,ABC123,ABC-123,123-ABC。
我已经做了以下表达式: r'\ b(?:[az]*[az \ - ] [az \ d [^\ \]*]*)*){2,}'
。
但是,这也确实拿到了这些错误的词:
A-BC,ABC
我相信问题是它寻找多个大写字母或 dashes。我希望它只给我至少有两个或更多大写字母的单词。我知道它也会“错误地”以“ ABC-ABC”为单词,但我不相信有一种避免这些话的方法。
I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'
.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
发布评论
评论(1)
如果支持lookahead,并且您不想匹配double
-
您可以使用:说明
\ b
word Boundare(?=
正面lookahead,断言从当前位置到右侧的位置是(?:[az \ d-]*[az]){2}
匹配允许字符和大写字符和大写字符AZ的2倍
)
关闭LookAhead> [A-ZA-Z \ D]+
匹配1倍没有连字符的允许字符
(?: - [A-ZA-Z \ D]+)*
-
和1倍允许的字符\ b
一个单词边界请参阅a Regex101演示。
当字符周围有连字符时,也不匹配,您可以使用负面的镜头,而不是左右的连字符。
请参阅另一个 regex demo 。
If a lookahead is supported and you don't want to match double
--
you might use:Explanation
\b
A word boundary(?=
Positive lookahead, assert that from the current location to the right is(?:[a-z\d-]*[A-Z]){2}
Match 2 times the optionally the allowed characters and an uppercase char A-Z)
Close the lookahead[A-Za-z\d]+
match 1+ times the allowed characters without the hyphen(?:-[A-Za-z\d]+)*
Optionally repeat-
and 1+ times the allowed characters\b
A word boundarySee a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
See another regex demo.