删除所有注释(单行/多行)&源文件中的空行
如何从 C# 源文件中删除所有注释和空行。请记住,可能存在嵌套注释。一些示例:
string text = @"//not a comment"; // a comment
/* multiline
comment */ string newText = "/*not a comment*/"; // a comment
/* multiline // not a comment
/* comment */ string anotherText = "/* not a comment */ // some text here\"// not a comment"; // a comment
我们可以拥有比上面三个示例更复杂的源。 有人可以建议一种正则表达式模式或其他方法来解决这个问题吗?我已经在互联网上浏览了很多东西,但找不到任何有用的东西。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用此答案中的函数:
然后删除空行。
You could use the function in this answer:
And then remove empty lines.
不幸的是,在没有边缘情况的情况下,使用正则表达式确实很难可靠地做到这一点。我没有调查得很远,但您也许可以使用 Visual Studio 语言服务来解析注释。
Unfortunatly this is really difficult to do reliably with regex without there being edge cases. I havnt investigated very far but you might be able to use the Visual Studio Language Services to parse comments.
如果您想使用正则表达式识别注释,那么您确实需要使用正则表达式作为标记器。即,它识别并提取字符串中的第一个内容,无论该内容是字符串文字、注释还是既不是字符串文字也不是注释的内容块。然后,您抓住字符串的其余部分,并从开头拉出下一个标记。
这可以帮助您解决上下文问题。如果您只是想查找字符串中间的内容,则没有好方法来识别特定的“注释”是否在字符串文字内 - 事实上,很难识别字符串文字在哪里首先,因为像
\"
这样的东西。但是如果你总是取字符串中的第一个东西,很容易说“哦,字符串以”开头
,所以直到下一个都没有转义“
is more string.” 上下文会自行处理。因此您需要三个正则表达式:
//
)。或/*
注释)。"
和@"
。 > 字符串;每个都有其自己的边缘情况,编写实际的正则表达式模式留给读者作为练习,因为编写和测试需要几个小时。这一切我都不愿意免费做。 (笑)但是如果你对正则表达式有很好的理解(或者有一个像 StackOverflow 这样的地方可以在你遇到困难时询问特定问题)并且愿意为你的代码编写一堆自动化测试,那么这当然是可行的。不过,请注意最后一个(“任何其他”)情况 - 如果后面跟着
"
,则您希望在@
之前停止,但如果它是@
转义关键字以用作标识符。If you want to identify comments with regexes, you really need to use the regex as a tokenizer. I.e., it identifies and extracts the first thing in the string, whether that thing be a string literal, a comment, or a block of stuff that is neither string literal nor comment. Then you grab the remainder of the string and pull the next token off the beginning.
This gets you around the problems with context. If you're just trying to look for things in the middle of the string, there's no good way to identify whether a particular "comment" is inside a string literal or not -- in fact, it's hard to identify where the string literals are in the first place, because of things like
\"
. But if you always take the first thing in the string, it's easy to say "oh, the string starts with"
, so everything up to the next unescaped"
is more string." Context takes care of itself.So you would want three regexes:
//
or a/*
comment)."
and@"
strings; each has its own edge cases.Writing the actual regex patterns is left as an exercise for the reader, since it would take hours to write and test it all and I'm not willing to do that for free. (grin) But it's certainly doable, if you have a good understanding of regexes (or have a place like StackOverflow to ask specific questions when you get stuck) and are willing to write a bunch of automated tests for your code. Watch out on that last ("anything else") case, though -- you want to stop just before an
@
if it's followed by a"
, but not if it's an@
to escape a keyword to use as an identifier.另请参阅我的 C# 代码压缩项目:CSharp-Minifier
除了删除注释、空格和和代码中的换行符,目前它能够压缩局部变量名称并进行其他缩小。
Also see my project for C# code minification: CSharp-Minifier
Aside of removing of comments, spaces and and line breaks from code, at present time it's able to compress local variable names and do another minifications.
首先,您肯定会想要使用
构建
rel="nofollow">
RegEx
实例。现在,您正在处理单行代码。为了补充使用
RegexOptions.SingleLine
选项,您需要确保使用 开始和结束字符串锚点(分别为^
和$
),对于您的具体情况,您希望应用正则表达式到整个 细绳。我还建议分解条件并使用 alternation 来处理较小的情况,构建一个较大的正则表达式来自较小的、更易于管理的表达式。
最后,我知道这是家庭作业,但是用正则表达式解析软件语言是徒劳的练习(它不是实际应用)。对于高度结构化的数据来说效果更好。如果您发现将来想做这样的事情,请使用专为该语言构建的解析器(在这种情况下,我强烈推荐Roslyn)。
First, you'll definitely want to use the
RegexOptions.SingleLine
when constructing yourRegEx
instance. Right now, you are processing single lines of code.To compliment the using of the
RegexOptions.SingleLine
option, you'll want to make sure you use the start and end string anchors (^
and$
respectively), as for the specific cases you have, you want the regular expression to apply to the entire string.I'd also recommend breaking up the conditions and using alternation to handle smaller cases, constructing a larger regular expression from the smaller, easier-to-manage expressions.
Finally, I know this is homework, but parsing a software language with regular expressions is an exercise in futility (it's not a practical application). It's better for more highly structured data. If you find in the future you want to do things like this, use a parser which is built for the language, (in this case, I'd highly recommend Roslyn).
使用我的项目删除大部分评论。 https://github.com/SynAppsDevelopment/CommentRemover
它删除所有整行、结尾行、和 XML Doc 代码注释,对自述文件和源代码中解释的复杂注释有一些限制。这是一个带有 WinForms 界面的 C# 解决方案。
Use my project to remove most comments. https://github.com/SynAppsDevelopment/CommentRemover
It removes all full-line, ending-line, and XML Doc code comments with some limitations for complex comments explained in the readme and source. This is a C# solution with a WinForms interface.