在r中拆分逗号和半脱离的字符串
我正在尝试将包含两个条目的字符串拆分,每个条目具有特定格式:
- 类别(例如
活动站点
/region
),其后是:
- 术语(例如
他的,glu
/核苷酸结合图案
),其后是,
这是我想要的字符串分裂:
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
这是我到目前为止尝试过的。除两个空的子字符串外,它还会产生所需的输出。
unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))
[1] "active site: His, Glu" "" "region: nucleotide-binding motif A"
[4] ""
我如何摆脱空的子字符串?
I'm trying to split a string containing two entries and each entry has a specific format:
- Category (e.g.
active site
/region
) which is followed by a:
- Term (e.g.
His, Glu
/nucleotide-binding motif A
) which is followed by a,
Here's the string that I want to split:
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
This is what I have tried so far. Except for the two empty substrings, it produces the desired output.
unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))
[1] "active site: His, Glu" "" "region: nucleotide-binding motif A"
[4] ""
How do I get rid of the empty substrings?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您获得空字符串,因为
*??
也可以匹配一个空字符串,在此主张(?=,(?\\ w+| $))
是真的,您可以在匹配
之前使用否定字符类排除匹配结肠或逗号:
说明
[^:,\ n]+
匹配1+ chars以外的其他chars:
,
或newline:
匹配Colon(?=
阳性lookahead,断言从当前位置直接到的权利是什么:,
从字面上看(?:\ w | $)
匹配一个单词char,或者断言字符串的末端regex demo | R demo
Output
You get the empty strings because
.*?
can also match an empty string where this assertion(?=,(?:\\w+|$))
is trueYou can exclude matching a colon or comma using a negated character class before matching
:
Explanation
[^:,\n]+
Match 1+ chars other than:
,
or a newline:
Match the colon.*?
Match any char as least as possbiel(?=
Positive lookahead, assert that what is directly to the right from the current position:,
Match literally(?:\w|$)
Match either a single word char, or assert the end of the string)
Close the lookaheadRegex demo | R demo
Output
更长,不像 @ @ @the第四只鸟+1那样优雅,
但它有效:
Much longer and not as elegant as @The fourth bird +1,
but it works: