正则表达式仅匹配大写“单词”有一些例外
我的技术字符串如下:
"The thing P1 must connect to the J236 thing in the Foo position."
我想用正则表达式匹配那些仅大写的单词(即这里的P1
和J236
)。问题是当句子的第一个字母是单字母单词时,我不想匹配它。
例如:
"A thing P1 must connect ..."
我只想要 P1
,而不是 A
和 P1
。通过这样做,我知道我可能会错过一个真正的“单词”(例如“X必须连接到Y”
),但我可以忍受它。
此外,如果句子全部大写,我不想匹配大写单词。
示例:
"THING P1 MUST CONNECT TO X2."
当然,理想情况下,我想在此处匹配技术单词 P1
和 X2
,但因为它们“隐藏”在全大写的句子中,并且由于这些技术单词言语没有特定的模式,这是不可能的。我再次可以忍受它,因为全大写的句子在我的文件中并不常见。
谢谢!
I have technical strings as the following:
"The thing P1 must connect to the J236 thing in the Foo position."
I would like to match with a regular expression those only-in-uppercase words (namely here P1
and J236
). The problem is that I don't want to match the first letter of the sentence when it is a one-letter word.
Example, in:
"A thing P1 must connect ..."
I want P1
only, not A
and P1
. By doing that, I know that I can miss a real "word" (like in "X must connect to Y"
) but I can live with it.
Additionally, I don't want to match uppercase words if the sentence is all uppercase.
Example:
"THING P1 MUST CONNECT TO X2."
Of course, ideally, I would like to match the technical words P1
and X2
here but since they are "hidden" in the all-uppercase sentence and since these technical words have no specific pattern, it's impossible. Again I can live with it because all-uppercase sentences are not so frequent in my files.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
在某种程度上,这会因您使用的正则表达式的“风格”而异。以下内容基于 .NET RegEx,它使用
\b
作为字边界。在最后一个示例中,它还使用否定查找(? 和
(?!)
以及非捕获括号(?:)< /code>
基本上,如果术语始终包含至少一个大写字母,后跟至少一个数字,则可以使用
For all-uppercase and numerics (total must be 2 or more):
For all-uppercase and numerics, but至少以一个字母开头:
祖父,返回包含大写字母和数字的任意组合的项目,但这些项目不是行开头的单个字母,也不是全大写的行的一部分:
breakdown:
正则表达式以
(?:
开头。?:
表示 -- 尽管后面的内容在括号中,但我对捕获不感兴趣这称为“非捕获括号”。在这里,我使用括号,因为我使用了交替(见下文),我有两个由管道符号 < 分隔的单独子句。代码>|。这是交替——就像“或”。正则表达式可以匹配第一个表达式或第二个表达式。这里的两种情况是“这是该行的第一个单词”或“其他所有单词”,因为我们有排除行开头的单字母单词的特殊要求。
现在,让我们看看交替中的每个表达式。
第一个表达式是:
(?。这里的主要子句是
[AZ]\b
,它是任何一个大写字母后跟一个单词边界,可以是标点符号、空格、换行符等。前面的部分是( ?,这是“负向回顾”。这是一个零宽度断言,这意味着它不会“消耗”字符作为匹配的一部分——在这里理解这一点并不重要。 .NET 中负向后查找的语法是
(?,其中 x 是在我们的 main 之前不存在的表达式条款。这里的表达式只是
^
或行首,因此交替的这一侧翻译为“由单个大写字母组成的任何单词,不位于行的开头。”好的,我们要匹配不在行开头的单字母大写单词。我们仍然需要匹配由所有数字和大写字母组成的单词。
这是由交替中第二个表达式的相对较小部分处理的:
\b[A-Z0-9]+\b
。\b
表示单词边界,[A-Z0-9]+
将一个或多个数字和大写字母匹配在一起。表达式的其余部分由其他环视组成。
(? 是另一种负向回顾,其中表达式为
^[A-Z0-9 ]*
。这意味着前面的内容不能全部是大写字母和数字。第二个lookaround是
(?![A-Z0-9 ]$)
,这是一个负lookahead。这意味着后面的内容不能全部是大写字母和数字。因此,总的来说,我们捕获所有大写字母和数字的单词,并排除行开头的单个字母、大写字符以及全大写行中的所有内容。
这里至少有一个弱点,即第二个交替表达式中的环视独立运行,因此像“A P1 should connect to the J9”这样的句子将匹配 J9,但不匹配 P1,因为 P1 之前的所有内容都是大写的。
解决这个问题是可能的,但它几乎会使正则表达式的长度增加三倍。尝试在一个正则表达式中做这么多事情很少(如果有的话)是合理的。您最好将工作分解为多个正则表达式或使用您选择的编程语言将正则表达式和标准字符串处理命令组合起来。
To some extent, this is going to vary by the "flavour" of RegEx you're using. The following is based on .NET RegEx, which uses
\b
for word boundaries. In the last example, it also uses negative lookaround(?<!)
and(?!)
as well as non-capturing parentheses(?:)
Basically, though, if the terms always contain at least one uppercase letter followed by at least one number, you can use
For all-uppercase and numbers (total must be 2 or more):
For all-uppercase and numbers, but starting with at least one letter:
The granddaddy, to return items that have any combination of uppercase letters and numbers, but which are not single letters at the beginning of a line and which are not part of a line that is all uppercase:
breakdown:
The regex starts with
(?:
. The?:
signifies that -- although what follows is in parentheses, I'm not interested in capturing the result. This is called "non-capturing parentheses." Here, I'm using the paretheses because I'm using alternation (see below).Inside the non-capturing parens, I have two separate clauses separated by the pipe symbol
|
. This is alternation -- like an "or". The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line.Now, let's look at each expression in the alternation.
The first expression is:
(?<!^)[A-Z]\b
. The main clause here is[A-Z]\b
, which is any one capital letter followed by a word boundary, which could be punctuation, whitespace, linebreak, etc. The part before that is(?<!^)
, which is a "negative lookbehind." This is a zero-width assertion, which means it doesn't "consume" characters as part of a match -- not really important to understand that here. The syntax for negative lookbehind in .NET is(?<!x)
, where x is the expression that must not exist before our main clause. Here that expression is simply^
, or start-of-line, so this side of the alternation translates as "any word consisting of a single, uppercase letter that is not at the beginning of the line."Okay, so we're matching one-letter, uppercase words that are not at the beginning of the line. We still need to match words consisting of all numbers and uppercase letters.
That is handled by a relatively small portion of the second expression in the alternation:
\b[A-Z0-9]+\b
. The\b
s represent word boundaries, and the[A-Z0-9]+
matches one or more numbers and capital letters together.The rest of the expression consists of other lookarounds.
(?<!^[A-Z0-9 ]*)
is another negative lookbehind, where the expression is^[A-Z0-9 ]*
. This means what precedes must not be all capital letters and numbers.The second lookaround is
(?![A-Z0-9 ]$)
, which is a negative lookahead. This means what follows must not be all capital letters and numbers.So, altogether, we are capturing words of all capital letters and numbers, and excluding one-letter, uppercase characters from the start of the line and everything from lines that are all uppercase.
There is at least one weakness here in that the lookarounds in the second alternation expression act independently, so a sentence like "A P1 should connect to the J9" will match J9, but not P1, because everything before P1 is capitalized.
It is possible to get around this issue, but it would almost triple the length of the regex. Trying to do so much in a single regex is seldom, if ever, justfied. You'll be better off breaking up the work either into multiple regexes or a combination of regex and standard string processing commands in your programming language of choice.
也许您可以先运行此正则表达式来查看该行是否全部大写:
只有当它是像
THING P1 MUST CONNECT TO X2 这样的行时才会匹配。
否则,您应该能够提取单个大写短语:
这应该与
中的“P1”和“J236”匹配。P1 必须连接到 Foo 位置中的 J236 事物。
Maybe you can run this regex first to see if the line is all caps:
That will match only if it's a line like
THING P1 MUST CONNECT TO X2.
Otherwise, you should be able to pull out the individual uppercase phrases with this:
That should match "P1" and "J236" in
The thing P1 must connect to the J236 thing in the Foo position.
不要做 [AZ] 或 [0-9] 之类的事情。改为使用 \p{Lu} 和 \d 。当然,这对于基于 Perl 的正则表达式风格是有效的。这包括java。
我建议你不要制作一些巨大的正则表达式。首先将文本分成句子。然后将其标记化(分成单词)。使用正则表达式检查每个标记/单词。跳过句子中的第一个标记。事先检查所有标记是否都是大写,如果是,则跳过整个句子,或者在这种情况下更改正则表达式。
Don't do things like [A-Z] or [0-9]. Do \p{Lu} and \d instead. Of course, this is valid for perl based regex flavours. This includes java.
I would suggest that you don't make some huge regex. First split the text in sentences. then tokenize it (split into words). Use a regex to check each token/word. Skip the first token from sentence. Check if all tokens are uppercase beforehand and skip the whole sentence if so, or alter the regex in this case.
为什么需要在一个怪物正则表达式中执行此操作?您可以使用实际代码来实现其中一些规则,并且如果这些需求稍后发生变化,这样做会更容易修改。
例如:
Why do you need to do this in one monster-regex? You can use actual code to implement some of these rules, and doing so would be much easier to modify if those requirements change later.
For example:
无论如何,我不是正则表达式专家。但是尝试一下:
我不会尝试整个大写句子的加分。呵呵
I'm not a regex guru by any means. But try:
I won't try for the bonus points of the whole upper case sentence. hehe
对于第一种情况,您建议可以使用:'[[:blank:]]+[A-Z0-9]+[[:blank:]]+',例如:
echo "The thing P1 must connect to the J236位于 Foo 位置的东西" | grep -oE '[[:blank:]]+[A-Z0-9]+[[:blank:]]+'
在第二种情况下,也许你需要使用其他东西而不是正则表达式,也许是一个带有技术词汇词典...
干杯,费尔南多
For the first case you propose you can use: '[[:blank:]]+[A-Z0-9]+[[:blank:]]+', for example:
echo "The thing P1 must connect to the J236 thing in the Foo position" | grep -oE '[[:blank:]]+[A-Z0-9]+[[:blank:]]+'
In the second case maybe you need to use something else and not a regex, maybe a script with a dictionary of technical words...
Cheers, Fernando