使用python用分隔符分割字符串,同时忽略引号内的分隔符和转义引号
我正在尝试根据分隔符的位置拆分字符串(我正在尝试从 Fortran 代码中删除注释)。我可以在以下字符串中使用 !
进行拆分:
x = '''print "hi!" ! Remove me'''
pattern = '''(?:[^!"]|"[^"]*")+'''
y = re.search(pattern, x)
但是,如果字符串包含转义引号,则此操作会失败,例如
z = '''print "h\"i!" ! Remove me'''
能否修改正则表达式以处理转义引号?或者我什至不应该使用正则表达式来解决此类问题?
I am trying to split a string based on the location of a delimiter (I am trying to remove comments from Fortran code). I can split using !
in the following string:
x = '''print "hi!" ! Remove me'''
pattern = '''(?:[^!"]|"[^"]*")+'''
y = re.search(pattern, x)
However, this fails if the string contains escape quotes, e.g.
z = '''print "h\"i!" ! Remove me'''
Can the regex be modified to handle escape quotes? Or should I not even be using regexps for this sort of problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一个经过验证的正则表达式(来自 掌握正则表达式),用于匹配双引号字符串文字其中可能包含反斜杠转义引号:
在分隔引号内,它会消耗以反斜杠开头的任何字符对,而无需费心识别第二个字符;这使得它能够处理转义的反斜杠和其他转义序列,而没有额外的麻烦。在没有 所有格量词 和 原子组,Python 不支持。
您的应用程序的完整正则表达式为:
在 ideone.com 上查看它的实际应用
免责声明:此答案与 FORTRAN 无关,仅涉及遵循问题中指定规则的代码。我从未使用过 FORTRAN,并且我在过去一个小时左右找到的每个参考文献似乎都描述了一种完全不同的语言。嗯!
这仅匹配包含注释的行,并捕获第 1 组中注释之前的所有内容。对于以
!
开头的行,捕获的长度可能为零。此正则表达式旨在与sub
一起使用,而不是与search
一起使用,如下所示:在 ideone.com 上查看它的实际应用
免责声明:此答案与 FORTRAN 无关,仅涉及遵循问题中指定规则的代码。我从未使用过 FORTRAN,并且我在过去一个小时左右找到的每个参考文献似乎都描述了一种完全不同的语言。嗯!
Here's a proven regex (from Mastering Regular Expressions) for matching double-quoted string literals which may contain backslash-escaped quotes:
Within the delimiting quotes, it consumes any pair of characters that starts with a backslash without bothering to identify the second character; that allows it to handle escaped backslashes and other escape sequences with no extra hassle. It's also as efficient as can be in the absence of possessive quantifiers and atomic groups, which aren't supported by Python.
The full regex for your application would be:
See it in action on ideone.com
DISCLAIMER: This answer is not about FORTRAN, only about code that follows the rules specified in the question. I've never worked with FORTRAN, and every reference I've found in the last hour or so seems to describe a completely different language. Meh!
This matches only lines that contain comments, and captures everything preceding the comment in group #1. The capture may be zero-length, for lines that start with
!
. This regex is intended for use withsub
rather thansearch
, as shown here:See it in action on ideone.com
DISCLAIMER: This answer is not about FORTRAN, only about code that follows the rules specified in the question. I've never worked with FORTRAN, and every reference I've found in the last hour or so seems to describe a completely different language. Meh!
Fortran 解析实际上相当棘手(参见例如线程此处)。我非常不熟悉语法的细节,以及“!”在哪里可能会发生。所以这里有一个想法:评论本身包含“!”的可能性有多大? ?如果不太可能,您可以简单地删除最后一个“!”之后的所有内容。每行:
这并不完美,但最坏的情况是,您最终会留下一些部分评论。这永远不会影响实际代码。
编辑:
看起来 NumPy 在 F2py 包 中包含一个基于 python 的 Fortran 解析器。根据许可限制,您也许能够对其进行修改,以可靠地解析“代码但不解析注释”。
Fortran parsing is actually quite tricky (see e.g. a thread here). I am blissfully unfamiliar with the details of the syntax, and where '!' might occur. So here is a thought: how likely is it that the comments themselves include '!' ? If it is not very likely, you might simply remove everything after the last '!' in each line:
This is not perfect, but at worst, you will end up leaving some partial comments. This should never affect actual code.
Edit:
Looks like NumPy includes a python-based Fortran parser in the F2py package. Depending on licensing constraints, you may be able to rework that to reliably parse 'code but not comments.'
您需要的是一个否定的后向断言:
(?。
例如:
输出:
正如评论中指出的,上面的表达式不会处理转义的反斜杠。此外,它不会处理 FORTRAN 中允许的单引号。这个也应该适用于这些情况(我认为):
这有点难看。 。 。
What you need is a negative lookbehind assertion:
(?<!...)
.For example:
Output:
As pointed out in the comments, the expression above will not handle escaped backslashes. Also it will not handle single quotes which are allowed in FORTRAN. This one should work for those cases as well (I think):
This is getting a little ugly . . .