REGEX搜索以提取r中的bibtex标题字符串

发布于 2025-02-11 18:15:46 字数 771 浏览 2 评论 0原文

我在R中有一个数据框架,其中一个列,名为title,是一个看起来像这样的bibtex条目:

={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n  
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n  
journal={Journal of the ACM (JACM)},\n  
volume={38},\n  
number={3},\n  
pages={690--728},\n  
year={1991},\n  
publisher={ACM New York, NY, USA}\n}

我只需要提取bibtex引用的标题,即= = {和 在此示例中,下一个}之前

,输出应为:

Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems

我需要对数据框中的所有行进行此操作。并非所有行都具有相同数量的bibtex字段,因此第一个}

我当前尝试sub(“。*\\ = {\\} \ \ s*(。+?)\\ s*\\ |。 }'

我应该如何做?

I have a data frame in R where one column, named Title, is a BibTeX entry that looks like this:

={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n  
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n  
journal={Journal of the ACM (JACM)},\n  
volume={38},\n  
number={3},\n  
pages={690--728},\n  
year={1991},\n  
publisher={ACM New York, NY, USA}\n}

I need to extract only the title for the BibTeX citation, which is the string after ={ and before the next }

In this example, the output should be:

Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems

I need to do this for all rows in the data frame. Not all rows have the same number of BibTeX fields, so the regex has to ignore everything after the first }

I'm currently trying sub(".*\\={\\}\\s*(.+?)\\s*\\|.*$", "\\1", data$Title) and am met with TRE pattern compilation error 'Invalid contents of {}'

How should I do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蔚蓝源自深海 2025-02-18 18:15:46

使用stringr :: str_extract和lookaround的可能解决方案:

library(stringr)

str_extract(s, "(?<=\\{)[^}]+(?=\\})")

#> [1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

A possible solution, using stringr::str_extract and lookaround:

library(stringr)

str_extract(s, "(?<=\\{)[^}]+(?=\\})")

#> [1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"
寂寞美少年 2025-02-18 18:15:46

请注意,{ char是一种特殊的正则元时间,需要逃脱。

要匹配卷曲括号之间的任何字符串,您需要基于否定的字符类(否定的括号表达式),例如\ {([^{}}]*)}

您可以使用

sub(".*?=\\{([^{}]*)}.*", "\\1", df$Title)

regex demo r demo

Title <- c("={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n  author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n  journal={Journal of the ACM (JACM)},\n  volume={38},\n  number={3},\n  pages={690--728},\n  year={1991},\n  publisher={ACM New York, NY, USA}\n}")
sub(".*?=\\{([^{}]*)}.*", "\\1", Title)

output:

[1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

模式详细信息

  • 。*?? - 任何零或更多chars,可能的
  • = \\ { - a = { substring
  • ([^{}]*) - 组1(\ 1 ):除卷曲括号以外的任何零或更多字符
  • } - a } char(这不是特别的,无需逃脱)
  • 。* - 字符串的其余部分。

Mind that the { char is a special regex metacharacter, it needs to be escaped.

To match any string between the curly braces, you need a negated character class (negated bracket expression) based pattern like \{([^{}]*)}.

You can use

sub(".*?=\\{([^{}]*)}.*", "\\1", df$Title)

See the regex demo and the R demo:

Title <- c("={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n  author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n  journal={Journal of the ACM (JACM)},\n  volume={38},\n  number={3},\n  pages={690--728},\n  year={1991},\n  publisher={ACM New York, NY, USA}\n}")
sub(".*?=\\{([^{}]*)}.*", "\\1", Title)

Output:

[1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

Pattern details:

  • .*? - any zero or more chars, as few as possible
  • =\\{ - a ={ substring
  • ([^{}]*) - Group 1 (\1): any zero or more chars other than curly braces
  • } - a } char (it is not special, no need to escape)
  • .* - the rest of the string.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文