(php) regexto 删除注释但忽略字符串中出现的情况

发布于 2024-08-25 13:29:14 字数 1087 浏览 13 评论 0原文

我正在写一个评论剥离器并试图满足这里的所有需求。我有下面的代码堆栈，它删除了几乎所有注释，但它实际上太过分了。我们花了很多时间尝试、测试和研究匹配的正则表达式模式，但我并不认为它们在每一个方面都是最好的。

我的问题是，我也遇到了这样的情况：我实际上并不想删除标准代码中的“PHP 注释”（这些注释并不是真正的注释，甚至是 PHP 字符串中的注释）。

示例：

<?php $Var = "Blah blah //this must not comment"; // this must comment. ?>

最终发生了什么是它虔诚地删除，这很好，但它留下了某些问题：

<?php  $Var = "Blah blah  ?>

另外：

也会引起问题，因为注释删除了该行的其余部分，包括结尾？>

看到问题了吗？ ...

'' 或 "" 中的注释字符需要被忽略
在同一行上使用双斜杠的 PHP 注释，也许应该只删除注释本身，或者应该删除整个 php 代码块

这是我使用的模式。此刻，请随时告诉我我现有的模式是否可以改进：）如果

$CompressedData = $OriginalData;
$CompressedData = preg_replace('!/\*.*?\*/!s', '', $CompressedData);  // removes /* comments */
$CompressedData = preg_replace('!//.*?\n!', '', $CompressedData); // removes //comments
$CompressedData = preg_replace('!#.*?\n!', '', $CompressedData); // removes # comments
$CompressedData = preg_replace('/<!--(.*?)-->/', '', $CompressedData); // removes HTML comments

您能给我任何帮助，我将不胜感激！ :)

原文

I am writing a comment-stripper and trying to accommodate for all needs here. I have the below stack of code which removes pretty much all comments, but it actually goes too far. A lot of time was spent trying and testing and researching the regex patterns to match, but I don't claim that they are the best at each.

My problem is that I also have situation where I have 'PHP comments' (that aren't really comments' in standard code, or even in PHP strings, that I don't actually want to have removed.

Example:

<?php $Var = "Blah blah //this must not comment"; // this must comment. ?>

What ends up happening is that it strips out religiously, which is fine, but it leaves certain problems:

<?php  $Var = "Blah blah  ?>

Also:

will also cause problems, as the comment removes the rest of the line, including the ending ?>

See the problem? So this is what I need...

Comment characters within '' or "" need to be ignored
PHP Comments on the same line, that use double-slashes, should remove perhaps only the comment itself, or should remove the entire php codeblock.

Here's the patterns I use at the moment, feel free to tell me if there's improvement I can make in my existing patterns? :)

$CompressedData = $OriginalData;
$CompressedData = preg_replace('!/\*.*?\*/!s', '', $CompressedData);  // removes /* comments */
$CompressedData = preg_replace('!//.*?\n!', '', $CompressedData); // removes //comments
$CompressedData = preg_replace('!#.*?\n!', '', $CompressedData); // removes # comments
$CompressedData = preg_replace('/<!--(.*?)-->/', '', $CompressedData); // removes HTML comments

Any help that you can give me would be greatly appreciated! :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

傲性难收 2024-09-01 13:29:15

在 REGEX 中执行此操作的一种方法是使用一个复合表达式和 preg_replace_callback。

我本来打算发布一个糟糕的示例，但最好的地方是查看 Dean Edwards 的 JS 打包脚本的 PHP 端口的源代码 - 您应该看到总体思路。

http://joliclic.free.fr/php/javascript-packer/en/< /a>

回复收藏 0 原文

尬尬 2024-09-01 13:29:15

试试这个

private function removeComments( $content ){
    $content = preg_replace( "!/\*.*?\*/!s" , '', $content );
    $content = preg_replace( "/\n\s*\n/" , "\n", $content );    
    $content = preg_replace( '#^\s*//.+$#m' , "", $content );
    $content = preg_replace( '![\s\t]//.*?\n!' , "\n", $content );
    $content = preg_replace( '/<\!--.*-->/' , "\n", $content );
    return $content;
}

try this

private function removeComments( $content ){
    $content = preg_replace( "!/\*.*?\*/!s" , '', $content );
    $content = preg_replace( "/\n\s*\n/" , "\n", $content );    
    $content = preg_replace( '#^\s*//.+$#m' , "", $content );
    $content = preg_replace( '![\s\t]//.*?\n!' , "\n", $content );
    $content = preg_replace( '/<\!--.*-->/' , "\n", $content );
    return $content;
}

回复收藏 0 原文

小嗷兮 2024-09-01 13:29:14

如果要解析PHP，可以使用token_get_all来获取给定 PHP 代码的标记。然后，您只需要迭代标记，删除注释标记并将其余的重新组合在一起。

但是您需要一个单独的 HTML 注释过程，最好是一个真正的解析器（例如 DOMDocument 提供 DOMDocument::loadHTML）。

回复收藏 0 原文

难忘№最初的完美 2024-09-01 13:29:14

你首先应该仔细考虑一下你是否真的想这样做。尽管您正在做的事情可能看起来很简单，但在最坏的情况下，它会变成极其复杂的问题（只需很少的正则表达式即可解决）。让我来说明一下在尝试从文件中删除 HTML 和 PHP 注释时您可能会遇到的几个问题。

您不能直接删除 HTML 注释，因为 HTML 注释中可能包含 PHP，例如：

<!-- HTML comment <?php echo 'Actual PHP'; ?> -->

您不能简单地单独处理 和中的内容？ > 标签，因为结尾标签 ?> 可以位于字符串甚至注释内，例如：

<?php /* ?> This is still a PHP comment <?php */ ?>

我们不要忘记，?> 实际上结束了PHP，如果它前面有一行注释。例如：

<?php // ?> This is not a PHP comment <?php ?>

当然，就像您已经说明的那样，字符串内的注释指示符会存在很多问题。解析字符串以忽略它们也不是那么简单，因为您必须记住引号可以被转义。喜欢：

<?php
$foo = ' /* // None of these start a comment ';
$bar = ' \' // Remember escaped quotes ';
$orz = " ' \" \' /* // Still not a comment ";
?>

解析顺序也会让你头疼。您不能简单地选择首先解析单行注释或多行注释。它们都必须同时解析（即按照它们在文档中出现的顺序）。否则你可能会得到损坏的代码。让我举例说明：

<?php
/* // Multiline comment */
// /* Single Line comment
$omg = 'This is not in a comment */';
?>

如果您首先解析多行注释，第二个 /* 将占用部分字符串，从而破坏代码。如果你先解析单行注释，你最终会吃掉第一个 */，这也会破坏代码。

正如您所看到的，如果您打算使用正则表达式解决问题，则必须考虑许多复杂的场景。唯一正确的解决方案是使用某种 PHP 解析器，例如 token_get_all() 来标记整个源代码并删除注释标记并重建文件。恐怕也不完全简单。它也对 HTML 注释没有帮助，因为 HTML 保持不变。您也不能使用 XML 解析器来获取 HTML 注释，因为 PHP 很少能很好地构造 HTML。

简而言之，你正在做的事情的想法很简单，但实际实施比看起来要困难得多。因此，我建议尽量避免这样做，除非您有充分的理由这样做。

You should first think carefully whether you actually want to do this. Though what you're doing may seem simple, in the worst case scenario, it becomes extremely complex problem (to solve with just few regular expressions). Let me just illustrate just of the few problems you would be facing when trying to strip both HTML and PHP comments from a file.

You can't straight out strip HTML comments, because you may have PHP inside the HTML comments, like:

<!-- HTML comment <?php echo 'Actual PHP'; ?> -->

You can't just simply separately deal with stuff inside the <?php and ?> tags either, since the ending thag ?> can be inside strings or even comments, like:

<?php /* ?> This is still a PHP comment <?php */ ?>

Let's not forget, that ?> actually ends the PHP, if it's preceded by one line comment. For example:

<?php // ?> This is not a PHP comment <?php ?>

Of course, like you already illustrated, there will be plenty of problems with comment indicators inside strings. Parsing out strings to ignore them isn't that simple either, since you have to remember that quotes can be escaped. Like:

<?php
$foo = ' /* // None of these start a comment ';
$bar = ' \' // Remember escaped quotes ';
$orz = " ' \" \' /* // Still not a comment ";
?>

Parsing order will also cause you headache. You can't just simply choose to parse either the one line comments first or the multi line comments first. They both have to be parsed at the same time (i.e. in the order they appear in the document). Otherwise you may end up with broken code. Let me illustrate:

<?php
/* // Multiline comment */
// /* Single Line comment
$omg = 'This is not in a comment */';
?>

If you parse multi line comments first, the second /* will eat up part of the string destroying the code. If you parse the single line comments first, you will end up eating the first */, which will also destroy the code.

As you can see, there are many complex scenarios you'd have to account, if you intend to solve your problem with regular expression. The only correct solution is to use some sort of PHP parser, like token_get_all(), to tokenize the entire source code and strip the comment tokens and rebuild the file. Which, I'm afraid, isn't entirely simple either. It also won't help with HTML comments, since the HTML is left untouched. You can't use XML parsers to get the HTML comments either, because the HTML is rarely well formed with PHP.

To put it short, the idea of what you're doing is simple, but the actual implementation is much harder than it seems. Thus, I would recommend trying to avoid doing this, unless you have a very good reason to do it.

回复收藏 0 原文

~没有更多了~