正则语法用于选择字符的第二个事件

发布于 2025-02-07 02:39:42 字数 254 浏览 1 评论 0原文

我有一个相对简单的问题,但无法在Regex中弄清楚正确的语法。 我有多个实验名称作为各种格式的字符串,例如 sef001dt45 bv004mf

我要做的是在数字值( dt MF )之后选择第二个字母的第二个事件。

我想知道 [AZ] {2} 仅中途解决我的问题。如何获得适当的子字符串?

I have a relatively simple problem but can't figure out the right syntax in RegEx.
I have multiple experiment names as strings in various formats, e.g. SEF001DT45 or BV004MF.

What I want to do is to select the second occurence of two letters after a numeric value (DT and MF in this case).

I figured out that
[A-Z]{2}
solves my problem only halfway. How do I get the proper substrings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

最舍不得你 2025-02-14 02:39:42

基于stringr :: str_extract和lookaround的可能解决方案:

library(stringr)

strings <- c("SEF001DT45", "BV004MF")

str_extract(strings, "(?<=\\d)[:upper:]{2}")

#> [1] "DT" "MF"

A possible solution, based on stringr::str_extract and lookaround:

library(stringr)

strings <- c("SEF001DT45", "BV004MF")

str_extract(strings, "(?<=\\d)[:upper:]{2}")

#> [1] "DT" "MF"
泅渡 2025-02-14 02:39:42

基础r:

# Using capture groups:
gsub(
  ".*\\d{2}(\\w{2}).*",
  "\\1",
  x
)

# Input data:
x <- c(
  'SEF001DT45',
  'BV004MF'
)

Base R:

# Using capture groups:
gsub(
  ".*\\d{2}(\\w{2}).*",
  "\\1",
  x
)

# Input data:
x <- c(
  'SEF001DT45',
  'BV004MF'
)
臻嫒无言 2025-02-14 02:39:42

详细信息之一获得第二次出现模式

sub('.*?PATTERN.*?(PATTERN).*', '\\1', x)
stringr::str_match(x, 'PATTERN.*?(PATTERN)')[,2]
regmatches(x, regexpr('PATTERN.*?\\KPATTERN', x, perl=TRUE))

tldr :通常,您可以使用以下

您可以使用

x <- c('SEF001DT45','BV004MF')
sub('.*?[A-Z]{2}.*?([A-Z]{2}).*', '\\1', x)
## => [1] "DT" "MF"

r demo Online REGEX DEMO 。这里的目的是匹配模式的第二次出现,捕获它,然后匹配其余图案,然后用反向注册替换为捕获组值。

请注意,sub将执行单个搜索并替换操作,这很好,因为此处的正则需要整个字符串匹配。

详细信息:

  • ​):两个大写ASCII字母
  • 。* - 尽可能多的零或更多字符。

您可以使用Stringr :: str_match

x <- c('SEF001DT45','BV004MF')
library(stringr)
results <- stringr::str_match(x, '[A-Z]{2}.*?([A-Z]{2})')
results[,2] ## Get Group 1 values

请参阅此r demo

或者,使用regMatches/regexpr in Base r:

x <- c('SEF001DT45','BV004MF')
results <- regmatches(x, regexpr('[A-Z]{2}.*?\\K[A-Z]{2}', x, perl=TRUE))
results

请参阅此r demo

在这里,[az] {2}。使用PCRE引擎)尽可能少,然后\ K丢弃匹配的文本和[az] {2}在模式结束时匹配第二个出现两个字母的大块。 Regexpr仅找到第一个匹配项。

TLDR: Generally, you can get the second occurrence of a PATTERN using one of the following

sub('.*?PATTERN.*?(PATTERN).*', '\\1', x)
stringr::str_match(x, 'PATTERN.*?(PATTERN)')[,2]
regmatches(x, regexpr('PATTERN.*?\\KPATTERN', x, perl=TRUE))

Details

You can use

x <- c('SEF001DT45','BV004MF')
sub('.*?[A-Z]{2}.*?([A-Z]{2}).*', '\\1', x)
## => [1] "DT" "MF"

See the R demo online and the regex demo. The point here is to match up to the second occurrence of the pattern, capture it, and then match the rest, and replace with the backreference to the capturing group value.

Note that sub will perform a single search and replace operation, and this is fine since the regex here requires the whole string match.

Details:

  • .*? - any zero or more chars as few as possible
  • [A-Z]{2} - two uppercase ASCII letters
  • .*? - any zero or more chars as few as possible
  • ([A-Z]{2}) - Group 1 (\1 refers to this group value): two uppercase ASCII letters
  • .* - any zero or more chars as many as possible.

You can achieve this with a simpler regex using stringr::str_match:

x <- c('SEF001DT45','BV004MF')
library(stringr)
results <- stringr::str_match(x, '[A-Z]{2}.*?([A-Z]{2})')
results[,2] ## Get Group 1 values

See this R demo.

Or, with regmatches/regexpr in base R:

x <- c('SEF001DT45','BV004MF')
results <- regmatches(x, regexpr('[A-Z]{2}.*?\\K[A-Z]{2}', x, perl=TRUE))
results

See this R demo.

Here, [A-Z]{2}.*?\\K[A-Z]{2} finds the first two uppercase ASCII letters, then matches any zero or more chars (other than line break chars since the PCRE engine is used) as few as possible, and then \K discards the matched text and the [A-Z]{2} at the end of the pattern matches the second occurrence of the two-letter chunk. regexpr only finds the first match.

顾忌 2025-02-14 02:39:42

也许:

s <- c("SEF001DT45", "BV004MF")
sub("[A-Z]+\\d+([A-Z]{2}).*", "\\1", s)
#sub("[A-Z]+[0-9]+([A-Z]{2}).*", "\\1", s) #Alternative
#[1] "DT" "MF"

哪里[Az]匹配字符,\\ d数字,[az] {2}两个字符和****。剩下的休息。
使用()选择了用\\ 1插入的内容。
或对更严格的第二个字母的事件

sub(".*?[A-Z]{2}[0-9]+([A-Z]{2}).*", "\\1", s)
#[1] "DT" "MF"

当仅提取第一个数字之后的两个字符就足够了:

regmatches(s, regexpr("(?<=\\d)[A-Z]{2}", s, perl=TRUE))
#[1] "DT" "MF"

Maybe:

s <- c("SEF001DT45", "BV004MF")
sub("[A-Z]+\\d+([A-Z]{2}).*", "\\1", s)
#sub("[A-Z]+[0-9]+([A-Z]{2}).*", "\\1", s) #Alternative
#[1] "DT" "MF"

Where [A-Z] matches characters, \\d numbers, [A-Z]{2} the two characters and .* for the remaining rest.
With () the content which is inserted with \\1 is selected.
Or something more strict about the second occurence of two letters:

sub(".*?[A-Z]{2}[0-9]+([A-Z]{2}).*", "\\1", s)
#[1] "DT" "MF"

When only the two characters after the first number should be extracted is enough:

regmatches(s, regexpr("(?<=\\d)[A-Z]{2}", s, perl=TRUE))
#[1] "DT" "MF"
回梦 2025-02-14 02:39:42

另一个基本技巧是strsplit

> sapply(strsplit(s, split = "\\d+"), `[[`, 2)
[1] "DT" "MF"

gsub

> gsub("^.*?(?<=\\d)(\\D+).*", "\\1", s, perl = TRUE)
[1] "DT" "MF"

Another base R trick is strsplit

> sapply(strsplit(s, split = "\\d+"), `[[`, 2)
[1] "DT" "MF"

or gsub

> gsub("^.*?(?<=\\d)(\\D+).*", "\\1", s, perl = TRUE)
[1] "DT" "MF"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文