如何匹配注释,除非它位于带引号的字符串中?
所以我有一些字符串:
//Blah blah blach
// sdfkjlasdf
"Another //thing"
我正在使用 java regex 来替换所有具有双斜杠的行,如下所示:
theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");
它在大多数情况下都有效,但问题是它删除了所有出现的情况,我需要找到一种方法让它不删除引用的事件。我该怎么做呢?
So I have some string:
//Blah blah blach
// sdfkjlasdf
"Another //thing"
And I'm using java regex to replace all the lines that have double slashes like so:
theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");
And it works for the most part, but the problem is it removes all the occurrences and I need to find a way to have it not remove the quoted occurrence. How would I go about doing that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用像 ANTLR 这样的第三方工具,而不是使用解析整个 Java 源文件的解析器,或者自己编写只解析您感兴趣的部分的东西。
ANTLR 能够仅定义您感兴趣的标记(当然还有可能会弄乱您的标记流的标记,例如多行注释以及字符串和字符文字)。因此,您只需要定义一个词法分析器(标记器的另一种说法)来正确处理这些标记。
这称为语法。在 ANTLR 中,这样的语法可能如下所示:
将上述内容保存在名为
FuzzyJavaLexer.g
的文件中。现在在此处下载 ANTLR 3.2 并将其保存在与您的相同的文件夹中FuzzyJavaLexer.g 文件。
执行以下命令:
这将创建一个
FuzzyJavaLexer.java
源类。当然,您需要测试词法分析器,可以通过创建一个名为
FuzzyJavaLexerTest.java
的文件并将以下代码复制到其中来完成:接下来,编译
FuzzyJavaLexer.java
和FuzzyJavaLexerTest.java
通过执行:最后执行
FuzzyJavaLexerTest.class
文件:或者:
之后您将看到以下内容打印到控制台:
很简单,嗯? :)
Instead of using a parser that parses an entire Java source file, or writing something yourself that parses only those parts you're interested in, you could use some 3rd party tool like ANTLR.
ANTLR has the ability to define only those tokens you are interested in (and of course the tokens that can mess up your token-stream like multi-line comments and String- and char literals). So you only need to define a lexer (another word for tokenizer) that correctly handles those tokens.
This is called a grammar. In ANTLR, such a grammar could look like this:
Save the above in a file called
FuzzyJavaLexer.g
. Now download ANTLR 3.2 here and save it in the same folder as yourFuzzyJavaLexer.g
file.Execute the following command:
which will create a
FuzzyJavaLexer.java
source class.Of course you need to test the lexer, which you can do by creating a file called
FuzzyJavaLexerTest.java
and copying the code below in it:Next, compile your
FuzzyJavaLexer.java
andFuzzyJavaLexerTest.java
by doing:and finally execute the
FuzzyJavaLexerTest.class
file:or:
after which you'll see the following being printed to your console:
Pretty easy, eh? :)
使用解析器,逐个字符地确定它。
开球示例:
Use a parser, determine it char-by-char.
Kickoff example:
(这是对 @finnw 在 他的回答。与其说是对OP问题的回答,不如说是对为什么正则表达式是错误工具的扩展解释。)
这是我的测试代码:
r0
是编辑后的正则表达式您的答案;它仅删除最后的注释(// 和另一个
),因为其他所有内容都在 group(1) 中匹配。设置多行模式 ((?m)
) 对于^
和$
正常工作是必要的,但这并不能解决这个问题 问题,因为你的字符类仍然可以匹配换行符。r1
处理换行问题,但它仍然错误地匹配字符串文字中的// no comment
,原因有两个:您没有在第一部分中包含反斜杠(?:[^\"\r\n]|\\\")
;并且您只使用了其中两个来匹配第二部分中的反斜杠。r2 修复了这个问题,但它不会尝试处理 char 文字中的引号或多行注释中的单行注释。它们可能也可以被处理,但是这个正则表达式已经是 Baby Godzilla 了;你真的想看到它长大吗?
(This is in answer to the question @finnw asked in the comment under his answer. It's not so much an answer to the OP's question as an extended explanation of why a regex is the wrong tool.)
Here's my test code:
r0
is the edited regex from your answer; it removes only the final comment (// and another
), because everything else is matched in group(1). Setting multiline mode ((?m)
) is necessary for^
and$
to work right, but it doesn't solve this problem because your character classes can still match newlines.r1
deals with the newline problem, but it still incorrectly matches// no comment
in the string literal, for two reasons: you didn't include a backslash in the first part of(?:[^\"\r\n]|\\\")
; and you only used two of them to match the backslash in the second part.r2
fixes that, but it makes no attempt to deal with the quote in thechar
literal, or single-line comments inside the multiline comments. They can probably be handled too, but this regex is already Baby Godzilla; do you really want to see it all grown up?.以下内容来自我几年前(用 Perl)编写的一个类似 grep 的程序。它有一个选项可以在处理文件之前删除 java 注释:
此代码实际上可以正常工作,并且不会被棘手的注释/引用组合所愚弄。它可能会被 unicode 转义符(\u0022 等)愚弄,但如果您愿意,您可以轻松地首先处理这些转义符。
由于它是 Perl,而不是 java,因此替换代码必须更改。我将快速破解生成等效的 java。等待...
编辑:我刚刚完成了这个。可能需要工作:
The following is from a grep-like program I wrote (in Perl) a few years ago. It has an option to strip java comments before processing the file:
This code does actually work properly and can't be fooled by tricky comment/quote combinations. It will probably be fooled by unicode escapes (\u0022 etc), but you can easily deal with those first if you want to.
As it's Perl, not java, the replacement code will have to change. I'll have a quick crack at producing equivalent java. Stand by...
EDIT: I've just whipped this up. Will probably need work:
您无法使用正则表达式判断您是否在双引号字符串中。最后,正则表达式只是一个状态机(有时是扩展位)。我将使用 BalusC 或 这个 提供的解析器。
如果您想知道为什么正则表达式受到限制,请阅读正式语法。维基百科文章是一个好的开始。
You can't tell using regex if you are in double quoted string or not. In the end regex is just a state machine (sometimes extended abit). I would use a parser as provided by BalusC or this one.
If you want know why the regex are limited read about formal grammars. A wikipedia article is a good start.