从文本中删除注释，引号之间的注释字符除外

发布于 2024-12-29 06:23:27 字数 219 浏览 2 评论 0原文

我正在尝试构建一个正则表达式来从配置文件中删除注释。注释用 ; 字符标记。例如：

; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

我遇到的困难是忽略放在引号之间的注释字符。

有什么想法吗？

原文

I'm trying to build a regexp for removing comments from a configuration file. Comments are marked with the ; character. For example:

; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

The difficulty I have is ignoring the comment character when it's placed between quotes.

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

安人多梦 2025-01-05 06:23:27

仅当分号后跟偶数个引号时，您才可以尝试匹配分号：

;(?=(?:[^"]*"[^"]*")*[^"]*$).*

请务必在使用此正则表达式时将 Singleline 选项关闭关闭，并且 Multiline< /code> 选项已打开。

在Python中：

>>> import re
>>> t = """; This is a comment line
... keyword1 keyword2 ; comment
... keyword3 "key ; word 4" ; comment"""
>>> regex = re.compile(';(?=(?:[^"]*"[^"]*")*[^"]*$).*', re.MULTILINE)
>>> regex.sub("", t)
'\nkeyword1 keyword2 \nkeyword3 "key ; word 4" '

You could try matching a semicolon only if it's followed by an even number of quotes:

;(?=(?:[^"]*"[^"]*")*[^"]*$).*

Be sure to use this regex with the Singleline option turned off and the Multiline option turned on.

In Python:

>>> import re
>>> t = """; This is a comment line
... keyword1 keyword2 ; comment
... keyword3 "key ; word 4" ; comment"""
>>> regex = re.compile(';(?=(?:[^"]*"[^"]*")*[^"]*$).*', re.MULTILINE)
>>> regex.sub("", t)
'\nkeyword1 keyword2 \nkeyword3 "key ; word 4" '

回复收藏 0 原文

摇划花蜜的午后 2025-01-05 06:23:27

没有正则表达式:)

$ grep -E -v '^;' input.txt
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

No regex :)

$ grep -E -v '^;' input.txt
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

回复收藏 0 原文

不醒的梦 2025-01-05 06:23:27

您可以使用正则表达式先取出所有字符串，用一些占位符替换它们，然后简单地截掉所有 \$.*，最后替换回字符串:)

回复收藏 0 原文

梦里南柯 2025-01-05 06:23:27

像这样的事情：

("[^"]*")*.*(;.*)

首先，匹配引号之间的任意数量的文本，然后匹配 ;。如果；位于引号之间，它将由第一组匹配，而不是第二组。

Something like this:

("[^"]*")*.*(;.*)

First, match any number of text between quotes, then match a ;. If the ; is between quotes it will be matches by the first group, not by the second group.

回复收藏 0 原文

唐婉 2025-01-05 06:23:27

我（有点意外）想出了一个有效的正则表达式：

replace(/^((?:[^'";]*(?:'[^']*'|"[^"]*")? )*)[ \t]*;.*$/gm, '$1')

我想要：

删除行首或行尾的单行注释，
使用单引号和双引号，
能够仅一个引用评论：那个's有用（但也接受"）
（因此，在评论分隔符后匹配平衡的引号（偶数），如 Tim Pietzcker 的答案是不合适的），
将注释分隔符 ; 单独保留在正确中（闭合）引用的“字符串”
混合引用样式
多个引用的字符串（以及注释中/注释后的注释）
嵌套单/双引号分别。要处理的双/单引号“字符串”
数据就像有效的 ini 文件（或程序集），只要它不包含转义引号或正则表达式文字等。

缺乏对 javascript 的回顾，我认为这可能是一个想法不匹配评论（并将其替换为''），但匹配评论之前的数据，然后将完整匹配数据替换为子比赛数据。
人们可以逐行设想这个概念（因此用匹配替换整行，从而“丢失”注释），但是多行参数似乎并不完全按照这种方式工作（至少在浏览器中）。

[^'";]* 开始吃掉从 'start' 开始的任何不 '"; 的字符。
^{（对我来说完全违反直觉，[^'";\r\n]* 将不起作用。）}

(?:'[^']*'|"[^"]*")? 是一个非捕获组，匹配零个或一组quote any chars quote ^{（和(?:(['"])[^\2]*\2)? in /^((?:[^'";]*(?:(['") ])[^\2]*\2)?)*)[ \t]*;.*$/gm 或
(?:(['"])[^\2\r\n]*\2)? 在 /^((?:[^'";]*(?: (['"])[^\2\r\n]*\2)?)*)[ \t]*;.*$/gm （虽然神秘地更好）不工作（在 db 上崩溃'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as)，但不添加另一个捕获组以便在比赛中重复使用是一件好事，因为无论如何它们都会受到惩罚)。上面的}

组合被放置在一个非捕获组中，它可以重复零次或多次，并且它的结果被放置在一个捕获组1中传递。

这样我们就得到了 [ \t]*;.* ，它“简单地”匹配零个或多个空格和制表符，后跟一个分号，后跟零个或多个不是换行的字符。请注意 ; 不是可选的！！！

要更好地了解此（多行参数）的工作原理，请点击下面演示中的 exp 按钮。

function demo(){
  var elms=document.getElementsByTagName('textarea');
  var str=elms[0].value;
  elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
                           , '$1'
                           )
                   .replace( /[ \t]*$/gm, ''); //optional trim
}


function demo_exp(){
  var elms=document.getElementsByTagName('textarea');
  var str=elms[0].value;
  elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
                           , '**S**$1**E**'  //to see start and end of match.
                           );
}

<textarea  style="width:98%;height:150px" onscroll="this.nextSibling.scrollTop=this.scrollTop;">
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

  
"Text; in" and between "quotes; plus" semicolons; this is the comment
  
  ; This is a comment line
  keyword1 keyword2 ; comment
  keyword3 'key ; word 4' ; comment and one quote ' ;see it?
  
_b64decode:
        db    0x83,0xc6,0x3A ; add   si, b64decode_end - _b64decode ;39
        push  'a'   
        pop   di 
  
        cmp   byte [si], 0x2B ; '+'


b64decode_end:
        ;append base64 data here
        ;terminate with printable character less than '+'
        db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as;df'" ;'haha"
;"end'
  
</textarea><textarea style="width:98%;height:150px" onscroll="this.previousSibling.scrollTop=this.scrollTop;">
result here
</textarea>
<br><button onclick="demo()">remove comments</button><button onclick="demo_exp()">exp</button>

希望这有帮助。

PS：请评论有效示例是否以及在哪里可能会出现问题！由于我普遍认为（根据丰富的个人经验）不可能使用正则表达式（尤其是高级编程语言）可靠地删除注释，因此我的直觉仍然认为这不可能是万无一失的。然而，我已经投入现有数据并精心设计了“假设”两个多小时，但无法打破它（我通常非常擅长）。

I (somewhat accidentally) came up with a working regex:

replace(/^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm, '$1')

I wanted:

remove single line comments at start of line or end of line,
to use single and double quotes,
the ability to have just one quote in a comment: that's useful (but accept " as well)
(so matching on a balanced set (even number) of quotes after a comment-delimiter as in Tim Pietzcker's answer was not suitable),
leave comment-delimiter ; alone in correctly (closed) quoted 'strings'
mix quoting style
multiple quoted strings (and comments in/after comments)
nest single/double quotes in resp. double/single quoted 'strings'
data to work on is like valid ini-files (or assembly), as long as it doesn't contain escaped quotes or regex-literals etc.

Lacking look-back on javascript I thought it might be an idea to not match comments (and replace them with ''), but match on data preceding the comment and then replace the full match data with the sub-match data.
One could envision this concept on a line by line basis (so replace the full line with the match, thereby 'loosing' the comment), BUT the multiline parameter doesn't seem to work exactly that way (at least in the browser).

[^'";]* starts eating any characters from the 'start' that are not '";.
^{(Completely counter-intuitive (to me), [^'";\r\n]* will not work.)}

(?:'[^']*'|"[^"]*")? is a non-capturing group matching zero or one set of quote any chars quote ^{(and (?:(['"])[^\2]*\2)? in /^((?:[^'";]*(?:(['"])[^\2]*\2)?)*)[ \t]*;.*$/gm or
(?:(['"])[^\2\r\n]*\2)? in /^((?:[^'";]*(?:(['"])[^\2\r\n]*\2)?)*)[ \t]*;.*$/gm (although mysteriously better) do not work (broke on db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as), but not adding another capturing group for re-use in the match is a good thing as they come with penalties anyway).}

The above combo is placed in a non-capturing group which may repeat zero or more times and it's result is placed in a capturing group 1 to pass along.

That leaves us with [ \t]*;.* which 'simply' matches zero or more spaces and tabs followed by a semicolon, followed by zero or more chars that are not a new line. Note how ; is NOT optional !!!

To get a better idea of how this (multi-line parameter) works, hit the exp button in the demo below.

function demo(){
  var elms=document.getElementsByTagName('textarea');
  var str=elms[0].value;
  elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
                           , '$1'
                           )
                   .replace( /[ \t]*$/gm, ''); //optional trim
}


function demo_exp(){
  var elms=document.getElementsByTagName('textarea');
  var str=elms[0].value;
  elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
                           , '**S**$1**E**'  //to see start and end of match.
                           );
}

<textarea  style="width:98%;height:150px" onscroll="this.nextSibling.scrollTop=this.scrollTop;">
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

  
"Text; in" and between "quotes; plus" semicolons; this is the comment
  
  ; This is a comment line
  keyword1 keyword2 ; comment
  keyword3 'key ; word 4' ; comment and one quote ' ;see it?
  
_b64decode:
        db    0x83,0xc6,0x3A ; add   si, b64decode_end - _b64decode ;39
        push  'a'   
        pop   di 
  
        cmp   byte [si], 0x2B ; '+'


b64decode_end:
        ;append base64 data here
        ;terminate with printable character less than '+'
        db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as;df'" ;'haha"
;"end'
  
</textarea><textarea style="width:98%;height:150px" onscroll="this.previousSibling.scrollTop=this.scrollTop;">
result here
</textarea>
<br><button onclick="demo()">remove comments</button><button onclick="demo_exp()">exp</button>

Hope this helps.

PS: Please comment valid examples if and where this might break! Since I generally agree (from extensive personal experience) that it is impossible to reliably remove comments using regex (especially higher level programming languages), my gut is still saying this can't be fool-proof. However I've been throwing existing data and crafted 'what-ifs' at it for over 2 hours and couldn't get it to break (, which I'm usually very good at).

回复收藏 0 原文

~没有更多了~