REGEX搜索以提取r中的bibtex标题字符串
我在R中有一个数据框架,其中一个列,名为title,是一个看起来像这样的bibtex条目:
={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n
journal={Journal of the ACM (JACM)},\n
volume={38},\n
number={3},\n
pages={690--728},\n
year={1991},\n
publisher={ACM New York, NY, USA}\n}
我只需要提取bibtex引用的标题,即= = {
和 在此示例中,下一个}
之前
,输出应为:
Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems
我需要对数据框中的所有行进行此操作。并非所有行都具有相同数量的bibtex字段,因此第一个}
我当前尝试sub(“。*\\ = {\\} \ \ s*(。+?)\\ s*\\ |。 }'
我应该如何做?
I have a data frame in R where one column, named Title, is a BibTeX entry that looks like this:
={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n
journal={Journal of the ACM (JACM)},\n
volume={38},\n
number={3},\n
pages={690--728},\n
year={1991},\n
publisher={ACM New York, NY, USA}\n}
I need to extract only the title for the BibTeX citation, which is the string after ={
and before the next }
In this example, the output should be:
Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems
I need to do this for all rows in the data frame. Not all rows have the same number of BibTeX fields, so the regex has to ignore everything after the first }
I'm currently trying sub(".*\\={\\}\\s*(.+?)\\s*\\|.*$", "\\1", data$Title)
and am met with TRE pattern compilation error 'Invalid contents of {}'
How should I do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
stringr :: str_extract
和lookaround的可能解决方案:A possible solution, using
stringr::str_extract
and lookaround:请注意,
{
char是一种特殊的正则元时间,需要逃脱。要匹配卷曲括号之间的任何字符串,您需要基于否定的字符类(否定的括号表达式),例如
\ {([^{}}]*)}
。您可以使用
regex demo 和 r demo :
output:
模式详细信息:
。*??
- 任何零或更多chars,可能的= \\ {
- a= {
substring([^{}]*)
- 组1(\ 1 ):除卷曲括号以外的任何零或更多字符
}
- a}
char(这不是特别的,无需逃脱)。* - 字符串的其余部分。
Mind that the
{
char is a special regex metacharacter, it needs to be escaped.To match any string between the curly braces, you need a negated character class (negated bracket expression) based pattern like
\{([^{}]*)}
.You can use
See the regex demo and the R demo:
Output:
Pattern details:
.*?
- any zero or more chars, as few as possible=\\{
- a={
substring([^{}]*)
- Group 1 (\1
): any zero or more chars other than curly braces}
- a}
char (it is not special, no need to escape).*
- the rest of the string.