在r中拆分逗号和半脱离的字符串

发布于 2025-01-23 14:03:56 字数 581 浏览 0 评论 0原文

我正在尝试将包含两个条目的字符串拆分，每个条目具有特定格式：

类别（例如活动站点/region），其后是：
术语（例如他的，glu/核苷酸结合图案），其后是，

这是我想要的字符串分裂：

string <- "active site: His, Glu,region: nucleotide-binding motif A,"

这是我到目前为止尝试过的。除两个空的子字符串外，它还会产生所需的输出。

unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))

[1] "active site: His, Glu"              ""                                   "region: nucleotide-binding motif A"
[4] ""

我如何摆脱空的子字符串？

原文

I'm trying to split a string containing two entries and each entry has a specific format:

Category (e.g. active site/region) which is followed by a :
Term (e.g. His, Glu/nucleotide-binding motif A) which is followed by a ,

Here's the string that I want to split:

string <- "active site: His, Glu,region: nucleotide-binding motif A,"

This is what I have tried so far. Except for the two empty substrings, it produces the desired output.

unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))

[1] "active site: His, Glu"              ""                                   "region: nucleotide-binding motif A"
[4] ""

How do I get rid of the empty substrings?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苍白女子 2025-01-30 14:03:56

您获得空字符串，因为*？？也可以匹配一个空字符串，在此主张（？=，（？\\ w+| $））是真的，

您可以在匹配之前使用否定字符类排除匹配结肠或逗号：

[^:,\n]+:.*?(?=,(?:\w|$))

说明

[^：，\ n]+匹配1+ chars以外的其他chars ： ，或newline
：匹配Colon
。
（？=阳性lookahead，断言从当前位置直接到的权利是什么：
- ，从字面上看
- （？：\ w | $）匹配一个单词char，或者断言字符串的末端
）关闭LookAhead

regex demo | R demo

string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))

Output

[1] "active site: His, Glu"              "region: nucleotide-binding motif A"

You get the empty strings because .*? can also match an empty string where this assertion (?=,(?:\\w+|$)) is true

You can exclude matching a colon or comma using a negated character class before matching :

[^:,\n]+:.*?(?=,(?:\w|$))

Explanation

[^:,\n]+ Match 1+ chars other than : , or a newline
: Match the colon
.*? Match any char as least as possbiel
(?= Positive lookahead, assert that what is directly to the right from the current position:
- , Match literally
- (?:\w|$) Match either a single word char, or assert the end of the string
) Close the lookahead

Regex demo | R demo

string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))

Output

[1] "active site: His, Glu"              "region: nucleotide-binding motif A"

回复收藏 0 原文

沉鱼一梦 2025-01-30 14:03:56

更长，不像 @ @ @the第四只鸟+1那样优雅，
但它有效：

library(stringr)

string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,
> string
[1] "active site: His, Glu"             
[2] "region: nucleotide-binding motif A"

, '')

Much longer and not as elegant as @The fourth bird +1,
but it works:

library(stringr)

string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,
> string
[1] "active site: His, Glu"             
[2] "region: nucleotide-binding motif A"

, '')

回复收藏 0 原文

~没有更多了~