自动解析 PHP,将 PHP 代码与 HTML 分离
我正在开发一个大型 PHP 代码库;我想将 PHP 代码与 HTML 和 JavaScript 分开。 (我需要对 PHP 代码进行多次自动搜索和替换,对 HTML 进行不同的搜索和替换,对 JS 进行不同的自动搜索和替换)。有没有一个好的解析器引擎可以为我分离出 PHP?我可以使用正则表达式来做到这一点,但它们并不完美。也许我可以在 ANTLR 中构建一些东西,但最好是现有的解决方案。
我应该澄清:我不想要也不需要完整的 PHP 解析器。只需知道给定的令牌是否是: - PHP代码 - PHP 单引号字符串 - PHP双引号字符串 - PHP 评论 - 不是 PHP,而是 HTML/JavaScript
I'm working on a large PHP code base; I'd like to separate the PHP code from the HTML and JavaScript. (I need to do several automatic search-and-replaces on the PHP code, and different ones on the HTML, and different on the JS). Is there a good parser engine that could separate out the PHP for me? I could do this using regular expressions, but they're not perfect. I could build something in ANTLR, perhaps, but a good already existing solution would be best.
I should make clear: I don't want or need a full PHP parser. Just need to know if a given token is:
- PHP code
- PHP single quote string
- PHP double quote string
- PHP Comment
- Not PHP, but rather HTML/JavaScript
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
PHP 本身内置的 tokenizer 怎么样?
您在注释中询问是否可以从标记化输出重新生成代码 - 但您可以,所有空格都保留为 T_WHITESPACE 标记。以下是将标记化输出转回代码的方法:
How about the tokenizer built right into PHP itself?
You ask in the comments whether you can regenerate the code from the tokenized output - yet you can, all whitespace is preserved as T_WHITESPACE tokens. Here's how you might turn the tokenized output back into code:
要将 PHP 与其他部分分开,PHP 的内置分词器是您的最佳选择:请参阅
token_get_all()
对于其余的,您可能最好使用 DOM 解析器。这样,隔离
部分(以及外部脚本,甚至
onXXXX
事件)就很容易了。不过,从解析后的 DOM 树重新构建相同的文档可能很困难 - 我想这取决于您需要对结果做什么以及原始 HTML 的干净程度。正则表达式(哎呀!)可以更好地解决该部分。
To separate the PHP from the rest, PHP's inbuilt tokenizer is your best choice: See
token_get_all()
For the rest, you might be best off with a DOM parser. Isolating the
<script>
parts (and external scripts, and evenonXXXX
events) is easy that way.It might be tough to re-build the identical document from a parsed DOM tree, though - I guess it depends on what you need to do with the results and how clean the original HTML is. A regular expression (yuck!) could work better for that part.
如果您想做的只是检查标记,那么正如其他人所建议的那样,PHP 标记生成器可能是一个不错的选择。
如果您想做的是以可靠的方式自动更改源代码,我不确定这会对您有帮助。您将如何重新生成修改后的源文本?
另一种方法是使用程序转换引擎。这样的引擎可以将源文本解析为抽象语法树,捕获程序的结构(以及所有标记的有效内容),并允许使用可靠的模式匹配/转换来搜索和转换这些 AST。为了做好这件事,你需要一个能够可靠地解析 PHP 的引擎,
并可以从更改后的 AST 重现可编译的源文本。
我们的DMS软件重组工具包就是这样一个程序转换系统,它有一个强大的功能PHP 前端,可以在解析、转换和漂亮打印方面准确处理 PHP5结果返回文本。 (正确使用 PHP 解析器很难,因为该语言的文档很少)。因为前端可以准确地获取 HTML 和 PHP 代码,所以您不需要分离文本;它们将停在独特的树节点中清晰可辨的地方。
要将所有回显的字符串从小写更改为大写,您需要使用 DMS 解析 PHP,然后应用以下转换规则:
该规则是用 DMS 的规则规范语言 (RSL) 编写的,这显然不是 PHP。引号内的内容是PHP代码;这些是包裹在正在操作的编程语言文本周围的元引号。 \ 字符是元转义符:\s 表示必须与字符串文字匹配的元变量,\uppercase 是 RSL 语言外部的 DMS 函数的名称,( ) 是元函数调用大写的元括号,应用于匹配的字符串 \s。因为该规则作用于 AST,所以不能混淆;它不会更改 /* echo 'def' */ 的文本,因为这不是一个语句。
您可能需要多个规则来处理各种语法组合:本例中的 STRING 仅指单引号文字字符串;双引号字符串不是整体实体,而是由一系列 QUOTED_STRING_FRAGMENT 组成,这些 QUOTED_STRING_FRAGMENTS 对应于双引号字符串内 PHP 表达式之间的双引号字符串中的文本。
在转换过程结束时,将发出更改后的 AST,其中包含原始缩进和注释(已应用转换的地方除外)。
DMS 还有一个完全语言准确的 JavaScript 解析器,如果您想准确处理 SCRIPT 标签的内容,您就需要它。
如果您想对源代码进行可靠的更改,恕我直言,这是唯一的好方法。您可以尝试字符串黑客和正则表达式,但是解析 PHP 需要上下文无关的解析器,而 RE 不会这样做,因此您得到的任何结果都不值得信任。
If all you want to do is to inspect the tokens, then the PHP tokenizer, as others have suggested, might be a good choice.
If what you want to do is to automatically change the source code in a reliable way, I'm not sure that will help you. How will you regenerate the modified source text?
Another way to do this is to use a program transformation engine. Such an engine can parse the source text to abstract syntax trees, capturing the structure of the program (as well as the effective content of all the tokens), and allow searching and transforming of those ASTs using reliable pattern matches/transformations. To do this well, you need an engine that parses PHP reliably,
and can reproduce compilable source text from the changed AST.
Our DMS Software Reengineering Toolkit is such a program transformation system, and it has a robust PHP Front End that can process PHP5 accurately in terms of parsing, transforming and prettyprinting the result back to text. (Getting the PHP parser right is hard because the language is poorly documented). Because the front end can pick up the HTML and the PHP code accurately, you don't need to separate out the text; they will parked in clearly distinguisable places in unique tree nodes.
To change all echoed strings from lowercase to uppercase, you'd use DMS to parse the PHP, and then apply the following transformation rule:
This rule is written in DMS's Rule Specification Language (RSL), which is clearly not PHP. The stuff inside quote marks is PHP code; those are meta quotes wrapped around the text of the programmming language being manipulated. The \ chararacter is an meta-escape: \s indicates a metavariable that must match a string literal, \uppercase is the name of a DMS function external to the RSL language and the ( ) are meta parentheses around the meta-function call to uppercase, applied to the matched string \s. Because the rule operates on the ASTs, it cannot be confused; it won't change the text of /* echo 'def' */ because that isn't a statement.
You likely need several rules to handle the variety of syntax combinations: STRING in this case refers to just singly-quoted literal strings; doubly-quoted strings aren't monolithic entities but are composed of a series of QUOTED_STRING_FRAGMENTS that correspond to the text in a doubly quoted string between the PHP expressions inside that doubly-quoted string.
At the end of the transformation process, the changed AST is emitted complete with the original indentation and comments except where the transformations have been applied.
There's also a fully language accurate JavaScript parser for DMS, too, which you'd need if you wanted to process the content of SCRIPT tags accurately.
If you want to make reliable changes to source code, this IMHO is the only good way to do it. You can try string hacking and regular expressions, but parsing PHP requires a context free parser and REs don't do that, so any result you get won't be trustworthy.