使用正则缩写

发布于 2025-01-30 11:06:04 字数 483 浏览 1 评论 0 原文

我正在尝试创建一个正则表达式,该表达式将识别Python中给定的字符串中可能的缩写。我对Regex是新手,尽管我相信这应该很简单,但我很难创建一个表达。该表达式应该拿起具有两个或更多大写字母的单词。该表达式还应该能够拾起在两者之间使用破折号并报告整个单词的单词(无论是在破折号之前还是之后)。如果也出现数字,也应该用该词进行报告。

因此,它应该接收:

ABC,ABC,ABC,A-ABC,A-ABC,ABC-A,ABC123,ABC-123,123-ABC。

我已经做了以下表达式: r'\ b(?:[az]*[az \ - ] [az \ d [^\ \]*]*)*){2,}'

但是,这也确实拿到了这些错误的词:

A-BC,ABC

我相信问题是它寻找多个大写字母 dashes。我希望它只给我至少有两个或更多大写字母的单词。我知道它也会“错误地”以“ ABC-ABC”为单词,但我不相信有一种避免这些话的方法。

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.

As such, it should pick up:

ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.

I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.

However this does also pick up these wrong words:

A-bc, a-b-c

I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

离旧人 2025-02-06 11:06:04

如果支持lookahead,并且您不想匹配double - 您可以使用:

\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b

说明

  • \ b word Boundare
  • (?= 正面lookahead,断言从当前位置到右侧的位置是
    • (?:[az \ d-]*[az]){2} 匹配允许字符和大写字符和大写字符AZ
    • 的2倍

  • 关闭LookAhead
  • > [A-ZA-Z \ D]+匹配1倍没有连字符的允许字符
  • (?: - [A-ZA-Z \ D]+)* - 和1倍允许的字符
  • \ b 一个单词边界

请参阅a Regex101演示

当字符周围有连字符时,也不匹配,您可以使用负面的镜头,而不是左右的连字符。

\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)

请参阅另一个 regex demo

If a lookahead is supported and you don't want to match double -- you might use:

\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b

Explanation

  • \b A word boundary
  • (?= Positive lookahead, assert that from the current location to the right is
    • (?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
  • ) Close the lookahead
  • [A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
  • (?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
  • \b A word boundary

See a regex101 demo.

To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.

\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)

See another regex demo.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文