使用python用分隔符分割字符串,同时忽略引号内的分隔符和转义引号

发布于 2024-10-19 16:26:52 字数 349 浏览 5 评论 0原文

我正在尝试根据分隔符的位置拆分字符串(我正在尝试从 Fortran 代码中删除注释)。我可以在以下字符串中使用 ! 进行拆分:

x = '''print "hi!" ! Remove me'''
pattern = '''(?:[^!"]|"[^"]*")+'''
y = re.search(pattern, x)

但是,如果字符串包含转义引号,则此操作会失败,例如

z = '''print "h\"i!" ! Remove me'''

能否修改正则表达式以处理转义引号?或者我什至不应该使用正则表达式来解决此类问题?

I am trying to split a string based on the location of a delimiter (I am trying to remove comments from Fortran code). I can split using ! in the following string:

x = '''print "hi!" ! Remove me'''
pattern = '''(?:[^!"]|"[^"]*")+'''
y = re.search(pattern, x)

However, this fails if the string contains escape quotes, e.g.

z = '''print "h\"i!" ! Remove me'''

Can the regex be modified to handle escape quotes? Or should I not even be using regexps for this sort of problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

黑白记忆 2024-10-26 16:26:52

这是一个经过验证的正则表达式(来自 掌握正则表达式),用于匹配双引号字符串文字其中可能包含反斜杠转义引号:

r'"[^"\\]*(?:\\.[^"\\]*)*"'

在分隔引号内,它会消耗以反斜杠开头的任何字符对,而无需费心识别第二个字符;这使得它能够处理转义的反斜杠和其他转义序列,而没有额外的麻烦。在没有 所有格量词原子组,Python 不支持。

您的应用程序的完整正则表达式为:

r'^((?:[^!"]+|"[^"\\]*(?:\\.[^"\\]*)*")*)!.*

这仅匹配包含注释的行,并捕获第 1 组中注释之前的所有内容。对于以 ! 开头的行,捕获的长度可能为零。此正则表达式旨在与 sub 一起使用,而不是与 search 一起使用,如下所示:

import re

pattern = r'^((?:[^!"]+|"[^"\\]*(?:\\.[^"\\]*)*")*)!.*

在 ideone.com 上查看它的实际应用

免责声明:此答案与 FORTRAN 无关,仅涉及遵循问题中指定规则的代码。我从未使用过 FORTRAN,并且我在过去一个小时左右找到的每个参考文献似乎都描述了一种完全不同的语言。嗯!

这仅匹配包含注释的行,并捕获第 1 组中注释之前的所有内容。对于以 ! 开头的行,捕获的长度可能为零。此正则表达式旨在与 sub 一起使用,而不是与 search 一起使用,如下所示:


在 ideone.com 上查看它的实际应用

免责声明:此答案与 FORTRAN 无关,仅涉及遵循问题中指定规则的代码。我从未使用过 FORTRAN,并且我在过去一个小时左右找到的每个参考文献似乎都描述了一种完全不同的语言。嗯!

x = '''print "hi!" ! Remove me''' y = re.sub(pattern, r'\1', x) print(y)

在 ideone.com 上查看它的实际应用

免责声明:此答案与 FORTRAN 无关,仅涉及遵循问题中指定规则的代码。我从未使用过 FORTRAN,并且我在过去一个小时左右找到的每个参考文献似乎都描述了一种完全不同的语言。嗯!

这仅匹配包含注释的行,并捕获第 1 组中注释之前的所有内容。对于以 ! 开头的行,捕获的长度可能为零。此正则表达式旨在与 sub 一起使用,而不是与 search 一起使用,如下所示:

在 ideone.com 上查看它的实际应用

免责声明:此答案与 FORTRAN 无关,仅涉及遵循问题中指定规则的代码。我从未使用过 FORTRAN,并且我在过去一个小时左右找到的每个参考文献似乎都描述了一种完全不同的语言。嗯!

Here's a proven regex (from Mastering Regular Expressions) for matching double-quoted string literals which may contain backslash-escaped quotes:

r'"[^"\\]*(?:\\.[^"\\]*)*"'

Within the delimiting quotes, it consumes any pair of characters that starts with a backslash without bothering to identify the second character; that allows it to handle escaped backslashes and other escape sequences with no extra hassle. It's also as efficient as can be in the absence of possessive quantifiers and atomic groups, which aren't supported by Python.

The full regex for your application would be:

r'^((?:[^!"]+|"[^"\\]*(?:\\.[^"\\]*)*")*)!.*

This matches only lines that contain comments, and captures everything preceding the comment in group #1. The capture may be zero-length, for lines that start with !. This regex is intended for use with sub rather than search, as shown here:

import re

pattern = r'^((?:[^!"]+|"[^"\\]*(?:\\.[^"\\]*)*")*)!.*

See it in action on ideone.com

DISCLAIMER: This answer is not about FORTRAN, only about code that follows the rules specified in the question. I've never worked with FORTRAN, and every reference I've found in the last hour or so seems to describe a completely different language. Meh!

This matches only lines that contain comments, and captures everything preceding the comment in group #1. The capture may be zero-length, for lines that start with !. This regex is intended for use with sub rather than search, as shown here:


See it in action on ideone.com

DISCLAIMER: This answer is not about FORTRAN, only about code that follows the rules specified in the question. I've never worked with FORTRAN, and every reference I've found in the last hour or so seems to describe a completely different language. Meh!

x = '''print "hi!" ! Remove me''' y = re.sub(pattern, r'\1', x) print(y)

See it in action on ideone.com

DISCLAIMER: This answer is not about FORTRAN, only about code that follows the rules specified in the question. I've never worked with FORTRAN, and every reference I've found in the last hour or so seems to describe a completely different language. Meh!

This matches only lines that contain comments, and captures everything preceding the comment in group #1. The capture may be zero-length, for lines that start with !. This regex is intended for use with sub rather than search, as shown here:

See it in action on ideone.com

DISCLAIMER: This answer is not about FORTRAN, only about code that follows the rules specified in the question. I've never worked with FORTRAN, and every reference I've found in the last hour or so seems to describe a completely different language. Meh!

泪冰清 2024-10-26 16:26:52

Fortran 解析实际上相当棘手(参见例如线程此处)。我非常不熟悉语法的细节,以及“!”在哪里可能会发生。所以这里有一个想法:评论本身包含“!”的可能性有多大? ?如果不太可能,您可以简单地删除最后一个“!”之后的所有内容。每行:

def cleanup(line):
  splitlist = line.split("!")
  if len(splitlist) > 1 and "\"" not in splitlist[-1]:
      return '!'.join(splitlist[:-1]).strip()
  else:
      return line

这并不完美,但最坏的情况是,您最终会留下一些部分评论。这永远不会影响实际代码。

编辑:

看起来 NumPy 在 F2py 包 中包含一个基于 python 的 Fortran 解析器。根据许可限制,您也许能够对其进行修改,以可靠地解析“代码但不解析注释”。

Fortran parsing is actually quite tricky (see e.g. a thread here). I am blissfully unfamiliar with the details of the syntax, and where '!' might occur. So here is a thought: how likely is it that the comments themselves include '!' ? If it is not very likely, you might simply remove everything after the last '!' in each line:

def cleanup(line):
  splitlist = line.split("!")
  if len(splitlist) > 1 and "\"" not in splitlist[-1]:
      return '!'.join(splitlist[:-1]).strip()
  else:
      return line

This is not perfect, but at worst, you will end up leaving some partial comments. This should never affect actual code.

Edit:

Looks like NumPy includes a python-based Fortran parser in the F2py package. Depending on licensing constraints, you may be able to rework that to reliably parse 'code but not comments.'

憧憬巴黎街头的黎明 2024-10-26 16:26:52

您需要的是一个否定的后向断言:(?。

例如:

z = r'''print "h\"i!" ! Remove me'''
pattern = r'''(?:[^!"]|(?<!\\)".*(?<!\\)")+'''
y = re.search(pattern, z)

print(y.group(0))

输出:

print "h\"i!" 

正如评论中指出的,上面的表达式不会处理转义的反斜杠。此外,它不会处理 FORTRAN 中允许的单引号。这个也应该适用于这些情况(我认为):

 pattern = r'''(?:[^!"']|((?<!\\)"|(\\\\)+").*?((?<!\\)"|(\\\\)+")|((?<!\\)'|(\\\\)+').*?((?<!\\)"|(\\\\)+'))+'''

这有点难看。 。 。

What you need is a negative lookbehind assertion: (?<!...).

For example:

z = r'''print "h\"i!" ! Remove me'''
pattern = r'''(?:[^!"]|(?<!\\)".*(?<!\\)")+'''
y = re.search(pattern, z)

print(y.group(0))

Output:

print "h\"i!" 

As pointed out in the comments, the expression above will not handle escaped backslashes. Also it will not handle single quotes which are allowed in FORTRAN. This one should work for those cases as well (I think):

 pattern = r'''(?:[^!"']|((?<!\\)"|(\\\\)+").*?((?<!\\)"|(\\\\)+")|((?<!\\)'|(\\\\)+').*?((?<!\\)"|(\\\\)+'))+'''

This is getting a little ugly . . .

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文