正则表达式消除bibtex文件中的字段

发布于 2024-09-16 03:47:23 字数 1909 浏览 5 评论 0原文

我正在尝试精简从参考管理器中获得的围兜文本文件，因为它留下了额外的字段，当我将其放入 LaTeX 中时，这些字段最终会被破坏。

我想要清理的一个特征条目是：

@Article{Kholmurodov:2001p113,
author = {K Kholmurodov and I Puzynin and W Smith and K Yasuoka and T Ebisuzaki}, 
journal = {Computer Physics Communications},
title = {MD simulation of cluster-surface impacts for metallic phases: soft landing, droplet spreading and implantation},
abstract = {Lots of text here.  Even more text.},
affiliation = {RIKEN, Inst Phys {\&} Chem Res, Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},
number = {1},
pages = {1--16},
volume = {141},
year = {2001},
month = {Dec},
language = {English},
keywords = {Ethane, molecular dynamics, Clusters, Dl_Poly Code, solid surface, metal, Hydrocarbon Thin-Films, Adsorption, impact, Impact Processes, solid surface, Molecular Dynamics Simulation, Large Systems, DL_POLY, Beam Deposition, Package, Collision-Induced Desorption, Diamond Films, Vapor-Deposition, Transition-Metals, Molecular-Dynamics Simulation}, 
date-added = {2008-06-27 08:58:25 -0500},
date-modified = {2009-03-24 15:40:27 -0500},
pmid = {000172275000001},
local-url = {file://localhost/User/user/Papers/2001/Kholmurodov/Kholmurodov-MD%20simulation%20of%20cluster-surface%20impacts-2001.pdf},
uri = {papers://B08E511A-2FA9-45A0-8612-FA821DF82090/Paper/p113},
read = {Yes},
rating = {0}
}

我想删除月份、摘要、关键字等字段，其中一些是单行，一些是多行。

我已经在 Python 中尝试过，如下所示：

fOpen = open(f,'r')
start_text = fOpen.read()
fOpen.close()

# regex
out_text = re.sub(r'^(month).*,\n','',start_text)
out_text = re.sub(r'^(annote)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(note)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(abstract)((.|\n)*?)\},\n','',out_text)

fNew = open(f,'w')
fNew.write(out_text)
fNew.close()

我尝试在 TextMate 中运行这些正则表达式，看看它们是否有效，然后再在 Python 中尝试，它们看起来没问题。

有什么建议吗？

谢谢。

原文

I am trying to slim down the bib text files I get from my reference manager because it leaves extra fields that end up getting mangled when I put it into LaTeX.

A characteristic entry that I want to clean up is:

@Article{Kholmurodov:2001p113,
author = {K Kholmurodov and I Puzynin and W Smith and K Yasuoka and T Ebisuzaki}, 
journal = {Computer Physics Communications},
title = {MD simulation of cluster-surface impacts for metallic phases: soft landing, droplet spreading and implantation},
abstract = {Lots of text here.  Even more text.},
affiliation = {RIKEN, Inst Phys {\&} Chem Res, Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},
number = {1},
pages = {1--16},
volume = {141},
year = {2001},
month = {Dec},
language = {English},
keywords = {Ethane, molecular dynamics, Clusters, Dl_Poly Code, solid surface, metal, Hydrocarbon Thin-Films, Adsorption, impact, Impact Processes, solid surface, Molecular Dynamics Simulation, Large Systems, DL_POLY, Beam Deposition, Package, Collision-Induced Desorption, Diamond Films, Vapor-Deposition, Transition-Metals, Molecular-Dynamics Simulation}, 
date-added = {2008-06-27 08:58:25 -0500},
date-modified = {2009-03-24 15:40:27 -0500},
pmid = {000172275000001},
local-url = {file://localhost/User/user/Papers/2001/Kholmurodov/Kholmurodov-MD%20simulation%20of%20cluster-surface%20impacts-2001.pdf},
uri = {papers://B08E511A-2FA9-45A0-8612-FA821DF82090/Paper/p113},
read = {Yes},
rating = {0}
}

I would like to eliminate fields like month, abstract, keywords, etc. some of which are single lines and some of which are multiple lines.

I have given it a try in Python and like this:

fOpen = open(f,'r')
start_text = fOpen.read()
fOpen.close()

# regex
out_text = re.sub(r'^(month).*,\n','',start_text)
out_text = re.sub(r'^(annote)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(note)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(abstract)((.|\n)*?)\},\n','',out_text)

fNew = open(f,'w')
fNew.write(out_text)
fNew.close()

I have tried to run these regexes in TextMate to see if they work before giving them a try in Python and they appear to be ok.

Any suggestions?

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

菊凝晚露 2024-09-23 03:47:23

这个正则表达式怎么样（与多行和 dotall 标志一起应用）：

^(?:month|annote|note|abstract)\s*=\s*\{(?:(?!\},$).)*\},[\r\n]+

说明：

^                             # start-of-line
(?:                           # non-capturing group 1
  month|annote|note|abstract  #   one of these terms
)                             # end non-capturing group 1
\s*=\s*                       # whitespace, an equals sign, whitespace
\{                            # a literal curly brace
(?:                           # non-capturing group 2
  (?!                         #   negative look-ahead (if not followed by...)
    \},$                      #     a curly brace, a comma and the end-of-line
  )                           #   end negative look-ahead
  .                           #   ...then match next character, whatever it is
)*                            # end non-capturing group 2, repeat
\},                           # a literal curly brace and a comma
[\r\n]+                       # at least one end-of-line character

这个单个表达式一步对所有受影响的行进行排序。

编辑/警告：请注意，一旦发生以下情况，此将失败：

affiliation = {RIKEN, Inst Phys {\&},
Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},

正则表达式无法处理嵌套结构。在这种情况下，没有任何纯正则表达式解决方案在所有情况下都是正确的，您能得到的最好的结果就是一个很好的近似值。

问题是你是否百分百确定上述情况不会发生（我认为你不可能）——或者你是否愿意承担风险。如果您不完全确定这不会成为问题 - 使用或编写解析器。

What about this regex (apply with multi-line and dotall flags):

^(?:month|annote|note|abstract)\s*=\s*\{(?:(?!\},$).)*\},[\r\n]+

Explanation:

^                             # start-of-line
(?:                           # non-capturing group 1
  month|annote|note|abstract  #   one of these terms
)                             # end non-capturing group 1
\s*=\s*                       # whitespace, an equals sign, whitespace
\{                            # a literal curly brace
(?:                           # non-capturing group 2
  (?!                         #   negative look-ahead (if not followed by...)
    \},$                      #     a curly brace, a comma and the end-of-line
  )                           #   end negative look-ahead
  .                           #   ...then match next character, whatever it is
)*                            # end non-capturing group 2, repeat
\},                           # a literal curly brace and a comma
[\r\n]+                       # at least one end-of-line character

This single expression sorts out all affected lines in one step.

EDIT / WARNING: Note that this will fail as soon as the following occurs:

affiliation = {RIKEN, Inst Phys {\&},
Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},

Nested structures cannot be handled by regular expressions. No pure regex solution can be correct in all cases in this context, the best you can get is a good approximation.

The question is if you if you are 100% sure that the situation above cannot occur (and I don't think you can be) - or if you are willing to take the risk. If you are not entirely sure that this will not a problem - use or write a parser.

回复收藏 0 原文

~没有更多了~