如何匹配注释，除非它位于带引号的字符串中？

发布于 2024-08-21 18:29:21 字数 320 浏览 9 评论 0原文

所以我有一些字符串：

//Blah blah blach
// sdfkjlasdf
"Another //thing"

我正在使用 java regex 来替换所有具有双斜杠的行，如下所示：

theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");

它在大多数情况下都有效，但问题是它删除了所有出现的情况，我需要找到一种方法让它不删除引用的事件。我该怎么做呢？

原文

So I have some string:

//Blah blah blach
// sdfkjlasdf
"Another //thing"

And I'm using java regex to replace all the lines that have double slashes like so:

theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");

And it works for the most part, but the problem is it removes all the occurrences and I need to find a way to have it not remove the quoted occurrence. How would I go about doing that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赴月观长安 2024-08-28 18:29:21

您可以使用像 ANTLR 这样的第三方工具，而不是使用解析整个 Java 源文件的解析器，或者自己编写只解析您感兴趣的部分的东西。

ANTLR 能够仅定义您感兴趣的标记（当然还有可能会弄乱您的标记流的标记，例如多行注释以及字符串和字符文字）。因此，您只需要定义一个词法分析器（标记器的另一种说法）来正确处理这些标记。

这称为语法。在 ANTLR 中，这样的语法可能如下所示：

lexer grammar FuzzyJavaLexer;

options{filter=true;}

SingleLineComment
  :  '//' ~( '\r' | '\n' )*
  ;

MultiLineComment
  :  '/*' .* '*/'
  ;

StringLiteral
  :  '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
  ;

CharLiteral
  :  '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
  ;

将上述内容保存在名为 FuzzyJavaLexer.g 的文件中。现在在此处下载 ANTLR 3.2 并将其保存在与您的 相同的文件夹中FuzzyJavaLexer.g 文件。

执行以下命令：

java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g

这将创建一个 FuzzyJavaLexer.java 源类。

当然，您需要测试词法分析器，可以通过创建一个名为 FuzzyJavaLexerTest.java 的文件并将以下代码复制到其中来完成：

import org.antlr.runtime.*;

public class FuzzyJavaLexerTest {
    public static void main(String[] args) throws Exception {
        String source = 
            "class Test {                                 \n"+
            "  String s = \" ... \\\" // no comment \";   \n"+
            "  /*                                         \n"+
            "   * also no comment: // foo                 \n"+
            "   */                                        \n"+
            "  char quote = '\"';                         \n"+
            "  // yes, a comment, finally!!!              \n"+
            "  int i = 0; // another comment              \n"+
            "}                                            \n";
        System.out.println("===== source =====");
        System.out.println(source);
        System.out.println("==================");
        ANTLRStringStream in = new ANTLRStringStream(source);
        FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        for(Object obj : tokens.getTokens()) {
            Token token = (Token)obj;
            if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
                System.out.println("Found a SingleLineComment on line "+token.getLine()+
                        ", starting at column "+token.getCharPositionInLine()+
                        ", text: "+token.getText());
            }
        }
    }
}

接下来，编译 FuzzyJavaLexer.java和 FuzzyJavaLexerTest.java 通过执行：

javac -cp .:antlr-3.2.jar *.java

最后执行 FuzzyJavaLexerTest.class 文件：

// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest

或者：

// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest

之后您将看到以下内容打印到控制台：

===== source =====
class Test {                                 
  String s = " ... \" // no comment ";   
  /*                                         
   * also no comment: // foo                 
   */                                        
  char quote = '"';                         
  // yes, a comment, finally!!!              
  int i = 0; // another comment              
}                                            

==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!              
Found a SingleLineComment on line 8, starting at column 13, text: // another comment

很简单，嗯？ :)

Instead of using a parser that parses an entire Java source file, or writing something yourself that parses only those parts you're interested in, you could use some 3rd party tool like ANTLR.

ANTLR has the ability to define only those tokens you are interested in (and of course the tokens that can mess up your token-stream like multi-line comments and String- and char literals). So you only need to define a lexer (another word for tokenizer) that correctly handles those tokens.

This is called a grammar. In ANTLR, such a grammar could look like this:

lexer grammar FuzzyJavaLexer;

options{filter=true;}

SingleLineComment
  :  '//' ~( '\r' | '\n' )*
  ;

MultiLineComment
  :  '/*' .* '*/'
  ;

StringLiteral
  :  '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
  ;

CharLiteral
  :  '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
  ;

Save the above in a file called FuzzyJavaLexer.g. Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.

Execute the following command:

java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g

which will create a FuzzyJavaLexer.java source class.

Of course you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below in it:

import org.antlr.runtime.*;

public class FuzzyJavaLexerTest {
    public static void main(String[] args) throws Exception {
        String source = 
            "class Test {                                 \n"+
            "  String s = \" ... \\\" // no comment \";   \n"+
            "  /*                                         \n"+
            "   * also no comment: // foo                 \n"+
            "   */                                        \n"+
            "  char quote = '\"';                         \n"+
            "  // yes, a comment, finally!!!              \n"+
            "  int i = 0; // another comment              \n"+
            "}                                            \n";
        System.out.println("===== source =====");
        System.out.println(source);
        System.out.println("==================");
        ANTLRStringStream in = new ANTLRStringStream(source);
        FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        for(Object obj : tokens.getTokens()) {
            Token token = (Token)obj;
            if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
                System.out.println("Found a SingleLineComment on line "+token.getLine()+
                        ", starting at column "+token.getCharPositionInLine()+
                        ", text: "+token.getText());
            }
        }
    }
}

Next, compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:

javac -cp .:antlr-3.2.jar *.java

and finally execute the FuzzyJavaLexerTest.class file:

// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest

or:

// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest

after which you'll see the following being printed to your console:

===== source =====
class Test {                                 
  String s = " ... \" // no comment ";   
  /*                                         
   * also no comment: // foo                 
   */                                        
  char quote = '"';                         
  // yes, a comment, finally!!!              
  int i = 0; // another comment              
}                                            

==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!              
Found a SingleLineComment on line 8, starting at column 13, text: // another comment

Pretty easy, eh? :)

回复收藏 0 原文

拥有 2024-08-28 18:29:21

使用解析器，逐个字符地确定它。

开球示例：

StringBuilder builder = new StringBuilder();
boolean quoted = false;

for (String line : string.split("\\n")) {
    for (int i = 0; i < line.length(); i++) {
        char c = line.charAt(i);
        if (c == '"') {
            quoted = !quoted;
        }
        if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
            break;
        } else {
            builder.append(c);
        }
    }
    builder.append("\n");
}

String parsed = builder.toString();
System.out.println(parsed);

Use a parser, determine it char-by-char.

Kickoff example:

StringBuilder builder = new StringBuilder();
boolean quoted = false;

for (String line : string.split("\\n")) {
    for (int i = 0; i < line.length(); i++) {
        char c = line.charAt(i);
        if (c == '"') {
            quoted = !quoted;
        }
        if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
            break;
        } else {
            builder.append(c);
        }
    }
    builder.append("\n");
}

String parsed = builder.toString();
System.out.println(parsed);

回复收藏 0 原文

药祭#氼 2024-08-28 18:29:21

（这是对 @finnw 在他的回答。与其说是对OP问题的回答，不如说是对为什么正则表达式是错误工具的扩展解释。）

这是我的测试代码：

String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";

String test = 
    "class Test {                                 \n"+
    "  String s = \" ... \\\" // no comment \";   \n"+
    "  /*                                         \n"+
    "   * also no comment: // but no harm         \n"+
    "   */                                        \n"+
    "  /* no comment: // much harm  */            \n"+
    "  char quote = '\"';  // comment             \n"+
    "  // another comment                         \n"+
    "  int i = 0; // and another                  \n"+
    "}                                            \n"
    .replaceAll(" +$", "");
System.out.printf("%n%s%n", test);

System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));

r0是编辑后的正则表达式您的答案;它仅删除最后的注释（// 和另一个），因为其他所有内容都在 group(1) 中匹配。设置多行模式 ((?m)) 对于 ^ 和 $ 正常工作是必要的，但这并不能解决这个问题问题，因为你的字符类仍然可以匹配换行符。

r1 处理换行问题，但它仍然错误地匹配字符串文字中的 // no comment，原因有两个：您没有在第一部分中包含反斜杠(?:[^\"\r\n]|\\\");并且您只使用了其中两个来匹配第二部分中的反斜杠。

r2 修复了这个问题，但它不会尝试处理 char 文字中的引号或多行注释中的单行注释。它们可能也可以被处理，但是这个正则表达式已经是 Baby Godzilla 了；你真的想看到它长大吗？

(This is in answer to the question @finnw asked in the comment under his answer. It's not so much an answer to the OP's question as an extended explanation of why a regex is the wrong tool.)

Here's my test code:

String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";

String test = 
    "class Test {                                 \n"+
    "  String s = \" ... \\\" // no comment \";   \n"+
    "  /*                                         \n"+
    "   * also no comment: // but no harm         \n"+
    "   */                                        \n"+
    "  /* no comment: // much harm  */            \n"+
    "  char quote = '\"';  // comment             \n"+
    "  // another comment                         \n"+
    "  int i = 0; // and another                  \n"+
    "}                                            \n"
    .replaceAll(" +$", "");
System.out.printf("%n%s%n", test);

System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));

r0 is the edited regex from your answer; it removes only the final comment (// and another), because everything else is matched in group(1). Setting multiline mode ((?m)) is necessary for ^ and $ to work right, but it doesn't solve this problem because your character classes can still match newlines.

r1 deals with the newline problem, but it still incorrectly matches // no comment in the string literal, for two reasons: you didn't include a backslash in the first part of (?:[^\"\r\n]|\\\"); and you only used two of them to match the backslash in the second part.

r2 fixes that, but it makes no attempt to deal with the quote in the char literal, or single-line comments inside the multiline comments. They can probably be handled too, but this regex is already Baby Godzilla; do you really want to see it all grown up?.

回复收藏 0 原文

白首有我共你 2024-08-28 18:29:21

以下内容来自我几年前（用 Perl）编写的一个类似 grep 的程序。它有一个选项可以在处理文件之前删除 java 注释：

# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file.  Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================

sub strip_java_comments
{
      s!(  (?: \" [^\"\\]*   (?:  \\.  [^\"\\]* )*  \" )
         | (?: \' [^\'\\]*   (?:  \\.  [^\'\\]* )*  \' )
         | (?: \/\/  [^\n] *)
         | (?: \/\*  .*? \*\/)
       )
       !
         my $x = $1;
         my $first = substr($x, 0, 1);
         if ($first eq '/')
         {
             "\n" x ($x =~ tr/\n//);
         }
         else
         {
             $x;
         }
       !esxg;
}

此代码实际上可以正常工作，并且不会被棘手的注释/引用组合所愚弄。它可能会被 unicode 转义符（\u0022 等）愚弄，但如果您愿意，您可以轻松地首先处理这些转义符。

由于它是 Perl，而不是 java，因此替换代码必须更改。我将快速破解生成等效的 java。等待...

编辑：我刚刚完成了这个。可能需要工作：

// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately.  You'll figure it out)

Pattern p = Pattern.compile(
       "(  (?: \" [^\"\\\\]*   (?:  \\\\.  [^\"\\\\]* )*  \" )" +  //    " ... "
       "  | (?: ' [^'\\\\]*    (?:  \\\\.  [^'\\\\]*  )*  '  )" +  // or ' ... '
       "  | (?: //  [^\\n] *    )" +                               // or // ...
       "  | (?: /\\*  .*? \\* / )" +                               // or /* ... */
       ")",
       Pattern.DOTALL  | Pattern.COMMENTS
);

Matcher m = p.matcher(entireInputFileAsAString);

StringBuilder output = new StringBuilder();

while (m.find())
{
    if (m.group(1).startsWith("/"))
    {
        // This is a comment. Replace it with a space...
        m.appendReplacement(output, " ");

        // ... or replace it with an equivalent number of newlines
        // (exercise for reader)
    }
    else
    {
        // We matched a quoted string.  Put it back
        m.appendReplacement(output, "$1");
    }
}

m.appendTail(output);
return output.toString();

The following is from a grep-like program I wrote (in Perl) a few years ago. It has an option to strip java comments before processing the file:

# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file.  Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================

sub strip_java_comments
{
      s!(  (?: \" [^\"\\]*   (?:  \\.  [^\"\\]* )*  \" )
         | (?: \' [^\'\\]*   (?:  \\.  [^\'\\]* )*  \' )
         | (?: \/\/  [^\n] *)
         | (?: \/\*  .*? \*\/)
       )
       !
         my $x = $1;
         my $first = substr($x, 0, 1);
         if ($first eq '/')
         {
             "\n" x ($x =~ tr/\n//);
         }
         else
         {
             $x;
         }
       !esxg;
}

This code does actually work properly and can't be fooled by tricky comment/quote combinations. It will probably be fooled by unicode escapes (\u0022 etc), but you can easily deal with those first if you want to.

As it's Perl, not java, the replacement code will have to change. I'll have a quick crack at producing equivalent java. Stand by...

EDIT: I've just whipped this up. Will probably need work:

// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately.  You'll figure it out)

Pattern p = Pattern.compile(
       "(  (?: \" [^\"\\\\]*   (?:  \\\\.  [^\"\\\\]* )*  \" )" +  //    " ... "
       "  | (?: ' [^'\\\\]*    (?:  \\\\.  [^'\\\\]*  )*  '  )" +  // or ' ... '
       "  | (?: //  [^\\n] *    )" +                               // or // ...
       "  | (?: /\\*  .*? \\* / )" +                               // or /* ... */
       ")",
       Pattern.DOTALL  | Pattern.COMMENTS
);

Matcher m = p.matcher(entireInputFileAsAString);

StringBuilder output = new StringBuilder();

while (m.find())
{
    if (m.group(1).startsWith("/"))
    {
        // This is a comment. Replace it with a space...
        m.appendReplacement(output, " ");

        // ... or replace it with an equivalent number of newlines
        // (exercise for reader)
    }
    else
    {
        // We matched a quoted string.  Put it back
        m.appendReplacement(output, "$1");
    }
}

m.appendTail(output);
return output.toString();

回复收藏 0 原文