在r中拆分逗号和半脱离的字符串

发布于 2025-01-23 14:03:56 字数 581 浏览 0 评论 0原文

我正在尝试将包含两个条目的字符串拆分,每个条目具有特定格式:

  • 类别(例如活动站点/region),其后是
  • 术语(例如他的,glu/核苷酸结合图案),其后是

这是我想要的字符串分裂:

string <- "active site: His, Glu,region: nucleotide-binding motif A,"

这是我到目前为止尝试过的。除两个空的子字符串外,它还会产生所需的输出。

unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))

[1] "active site: His, Glu"              ""                                   "region: nucleotide-binding motif A"
[4] "" 

我如何摆脱空的子字符串?

I'm trying to split a string containing two entries and each entry has a specific format:

  • Category (e.g. active site/region) which is followed by a :
  • Term (e.g. His, Glu/nucleotide-binding motif A) which is followed by a ,

Here's the string that I want to split:

string <- "active site: His, Glu,region: nucleotide-binding motif A,"

This is what I have tried so far. Except for the two empty substrings, it produces the desired output.

unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))

[1] "active site: His, Glu"              ""                                   "region: nucleotide-binding motif A"
[4] "" 

How do I get rid of the empty substrings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

苍白女子 2025-01-30 14:03:56

您获得空字符串,因为*??也可以匹配一个空字符串,在此主张(?=,(?\\ w+| $))是真的,

您可以在匹配之前使用否定字符类排除匹配结肠或逗号:

[^:,\n]+:.*?(?=,(?:\w|$))

说明

  • [^:,\ n]+匹配1+ chars以外的其他chars 或newline
  • 匹配Colon
  • (?=阳性lookahead,断言从当前位置直接到的权利是什么:
    • 从字面上看
    • (?:\ w | $)匹配一个单词char,或者断言字符串的末端
  • )关闭LookAhead

regex demo | R demo

string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))

Output

[1] "active site: His, Glu"              "region: nucleotide-binding motif A"

You get the empty strings because .*? can also match an empty string where this assertion (?=,(?:\\w+|$)) is true

You can exclude matching a colon or comma using a negated character class before matching :

[^:,\n]+:.*?(?=,(?:\w|$))

Explanation

  • [^:,\n]+ Match 1+ chars other than : , or a newline
  • : Match the colon
  • .*? Match any char as least as possbiel
  • (?= Positive lookahead, assert that what is directly to the right from the current position:
    • , Match literally
    • (?:\w|$) Match either a single word char, or assert the end of the string
  • ) Close the lookahead

Regex demo | R demo

string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))

Output

[1] "active site: His, Glu"              "region: nucleotide-binding motif A"
沉鱼一梦 2025-01-30 14:03:56

更长,不像 @ @ @the第四只鸟+1那样优雅,
但它有效:

library(stringr)

string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,
> string
[1] "active site: His, Glu"             
[2] "region: nucleotide-binding motif A"
, '')

Much longer and not as elegant as @The fourth bird +1,
but it works:

library(stringr)

string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,
> string
[1] "active site: His, Glu"             
[2] "region: nucleotide-binding motif A"
, '')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文