当前位置：文江博客话题详情

用正则表达式提取单词

发布于 2024-09-04 14:26:50 字数 235 浏览 4 评论 0原文

我有一个字符串 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 我想提取单词 temperatoA 和 CelcieusB。

我有这个正则表达式 (\d+/(\w+),?)*! 但我只得到匹配 1/temperatoA,2/CelcieusB!

为什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一袭水袖舞倾城 2024-09-11 14:26:50

您的整个匹配结果为'1/temperatoA,2/CelcieusB'，因为它匹配以下表达式：

qr{ (       # begin group 
      \d+   # at least one digit
      /     # followed by a slash
     (\w+)  # followed by at least one word characters
     ,?     # maybe a comma
    )*      # ANY number of repetitions of this pattern.
}x;

'1/temperatoA,' 满足捕获#1 首先，但由于您要求引擎捕获尽可能多的内容，因此它会返回并发现该模式在 '2/CelcieusB' 中重复（逗号不是必需的）。所以整场比赛就是你所说的那样，但你可能没想到的是 '2/CelcieusB' 替换 '1/temperatoA,' 为 $1，因此 $1 读取为 '2/CelcieusB'。

每当您想要捕获符合特定字符串中特定模式的任何内容时，最好使用 global 标志并将捕获分配到数组中。由于数组不是像 $1 这样的单个标量，因此它可以保存为捕获 #1 捕获的所有值。

当我这样做时：

my $str   = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my @matches = $str =~ /$regex/g ) { 
    print Dumper( \@matches );
}

我得到这个：

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB',
          '23/33',
          '33',
          '55/66',
          '66'
        ];

现在，我认为这可能不是您所期望的。但是 '3' 和 '6' 是单词字符，因此（在斜线之后）它们符合表达式。

因此，如果这是一个问题，您可以将正则表达式更改为等效的：qr{(\d+/(\p{Alpha}\w*))}，指定第一个字符必须是一个 alpha 后跟任意数量的单词字符。然后转储看起来像这样：

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB'
        ];

如果您只想要 'temperatoA' 或 'CelcieusB'，那么您捕获的内容会超出您的需要，并且您会想要正则表达式为 qr{\d+/(\p{Alpha}\w*)}。

然而，在捕获表达式中捕获多个块的秘诀是将匹配项分配给一个数组，然后您可以对数组进行排序以查看它是否包含您想要的数据。

Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:

qr{ (       # begin group 
      \d+   # at least one digit
      /     # followed by a slash
     (\w+)  # followed by at least one word characters
     ,?     # maybe a comma
    )*      # ANY number of repetitions of this pattern.
}x;

'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.

Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.

When I do this:

my $str   = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my @matches = $str =~ /$regex/g ) { 
    print Dumper( \@matches );
}

I get this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB',
          '23/33',
          '33',
          '55/66',
          '66'
        ];

Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.

So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB'
        ];

And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.

However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.

回复收藏 0 原文

一身骄傲 2024-09-11 14:26:50

这里的问题是：为什么你使用一个明显错误的正则表达式？你是怎么得到它的？

你想要的表达式简单如下：

(\w+)

The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?

The expression you want is simply as follows:

(\w+)

回复收藏 0 原文

故事与诗 2024-09-11 14:26:50

使用 Perl 兼容的正则表达式引擎，您可以搜索

(?<=\d/)\w+(?=.*!)

(?<=\d/) 断言匹配开始之前有一个数字和一个斜杠

\w+ 匹配标识符。这允许使用字母、数字和下划线。如果您只想允许字母，请改用 [A-Za-z]+。

(?=.*!) 断言字符串前面有一个 ! - 也就是说，一旦我们传递了 !，正则表达式就会失败。

根据您使用的语言，您可能需要转义正则表达式中的某些字符。

例如，要在 C 中使用（使用 PCRE 库），您需要转义反斜杠：

myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);

With a Perl-compatible regex engine you can search for

(?<=\d/)\w+(?=.*!)

(?<=\d/) asserts that there is a digit and a slash before the start of the match

\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.

(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.

Depending on the language you're using, you might need to escape some of the characters in the regex.

E. g., for use in C (with the PCRE library), you need to escape the backslashes:

myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);

回复收藏 0 原文

谈情不如逗狗 2024-09-11 14:26:50

这行得通吗？

/([[:alpha:]]\w+)\b(?=.*!)

我做出了以下假设...

单词以字母字符开头。
单词总是紧跟在斜线之后。中间没有空格，也没有文字。
感叹号后面的单词将被忽略。
你有某种循环来捕获多个单词。我对 C 库不够熟悉，无法举个例子。

[[:alpha:]] 匹配任何字母字符。

\b 匹配单词边界。

(?=.*!) 来自 Tim Pietzcker 的帖子。

Will this work?

/([[:alpha:]]\w+)\b(?=.*!)

I made the following assumptions...

A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.

[[:alpha:]] matches any alphabetic character.

The \b matches a word boundary.

And the (?=.*!) came from Tim Pietzcker's post.

回复收藏 0 原文

~没有更多了~