用正则表达式提取单词
我有一个字符串 1/temperatoA,2/CelcieusB!23/33/44,55/66/77
我想提取单词 temperatoA
和 CelcieusB。
我有这个正则表达式 (\d+/(\w+),?)*!
但我只得到匹配 1/temperatoA,2/CelcieusB!
为什么?
I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77
and I would like to extract the words temperatoA
and CelcieusB
.
I have this regular expression (\d+/(\w+),?)*!
but I only get the match 1/temperatoA,2/CelcieusB!
Why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您的整个匹配结果为
'1/temperatoA,2/CelcieusB'
,因为它匹配以下表达式:'1/temperatoA,'
满足捕获#1 首先,但由于您要求引擎捕获尽可能多的内容,因此它会返回并发现该模式在'2/CelcieusB'
中重复(逗号不是必需的) 。所以整场比赛就是你所说的那样,但你可能没想到的是'2/CelcieusB'
替换'1/temperatoA,'
为$1
,因此$1
读取为'2/CelcieusB'
。每当您想要捕获符合特定字符串中特定模式的任何内容时,最好使用 global 标志并将捕获分配到数组中。由于数组不是像
$1
这样的单个标量,因此它可以保存为捕获 #1 捕获的所有值。当我这样做时:
我得到这个:
现在,我认为这可能不是您所期望的。但是
'3'
和'6'
是单词字符,因此(在斜线之后)它们符合表达式。因此,如果这是一个问题,您可以将正则表达式更改为等效的:
qr{(\d+/(\p{Alpha}\w*))}
,指定第一个字符必须是一个 alpha 后跟任意数量的单词字符。然后转储看起来像这样:如果您只想要
'temperatoA'
或'CelcieusB'
,那么您捕获的内容会超出您的需要,并且您会想要正则表达式为qr{\d+/(\p{Alpha}\w*)}
。然而,在捕获表达式中捕获多个块的秘诀是将匹配项分配给一个数组,然后您可以对数组进行排序以查看它是否包含您想要的数据。
Your whole match evaluates to
'1/temperatoA,2/CelcieusB'
because that matches the following expression:'1/temperatoA,'
fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in'2/CelcieusB'
(the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that'2/CelcieusB'
replaces'1/temperatoA,'
as$1
, so$1
reads'2/CelcieusB'
.Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like
$1
, it can hold all the values that were captured for capture #1.When I do this:
I get this:
Now, I figure that's probably not what you expected. But
'3'
and'6'
are word characters, and so--coming after a slash--they comply with the expression.So, if this is an issue, you can change your regex to the equivalent:
qr{(\d+/(\p{Alpha}\w*))}
, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:And if you only want
'temperatoA'
or'CelcieusB'
, then you're capturing more than you need to and you'll want your regex to beqr{\d+/(\p{Alpha}\w*)}
.However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
这里的问题是:为什么你使用一个明显错误的正则表达式?你是怎么得到它的?
你想要的表达式简单如下:
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
使用 Perl 兼容的正则表达式引擎,您可以搜索
(?<=\d/)
断言匹配开始之前有一个数字和一个斜杠\w+
匹配标识符。这允许使用字母、数字和下划线。如果您只想允许字母,请改用[A-Za-z]+
。(?=.*!)
断言字符串前面有一个!
- 也就是说,一旦我们传递了!
,正则表达式就会失败。根据您使用的语言,您可能需要转义正则表达式中的某些字符。
例如,要在 C 中使用(使用 PCRE 库),您需要转义反斜杠:
With a Perl-compatible regex engine you can search for
(?<=\d/)
asserts that there is a digit and a slash before the start of the match\w+
matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use[A-Za-z]+
instead.(?=.*!)
asserts that there is a!
ahead in the string - i. e. the regex will fail once we have passed the!
.Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
这行得通吗?
我做出了以下假设...
[[:alpha:]]
匹配任何字母字符。\b
匹配单词边界。(?=.*!)
来自 Tim Pietzcker 的帖子。Will this work?
I made the following assumptions...
[[:alpha:]]
matches any alphabetic character.The
\b
matches a word boundary.And the
(?=.*!)
came from Tim Pietzcker's post.