语法突出显示/词法分析算法
语法荧光笔使用的一般算法是什么?我使用正则表达式中的交替实现了一种简单的方法:
STRING_PATTERN|COMMENT_PATTERN|KEYWORD_PATTERNS
由于检测某物是字符串还是模式取决于哪个先出现:
// This is a "comment"
"This is a // string"
但是关键字会变得更加复杂。这种方法在我当前的实现中有效,但我不相信它是最佳的。
另一个问题是突出显示的顺序。如果您在标识符/关键字之前突出显示数字,那么您可能会意外地突出显示关键字中的数字...
编辑:
我的插件现在在这里:http://wordpress.org/extend/plugins/crayon-syntax-highlighter/
What is the general algorithm used by a syntax highlighter? I have implemented a simple approach using alternation in regex:
STRING_PATTERN|COMMENT_PATTERN|KEYWORD_PATTERNS
Since detecting whether something is a string or a pattern depends on which comes first:
// This is a "comment"
"This is a // string"
But it gets a bit more complicated with keywords. This approach is working in my current implementation, but I'm not convinced it's optimal.
Another problem is the order you highlight in. If you highlight numbers before identifiers/keywords then you might accidentally highlight a number within a keyword...
EDIT:
My plugin is now here: http://wordpress.org/extend/plugins/crayon-syntax-highlighter/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能很难使用正则表达式来做到这一点,因为它不能帮助您的语法突出显示理解上下文,即正则表达式将匹配出现在任何地方的内容,无论它是否是更大的可能匹配的一部分。
您需要研究解析器生成器,例如 Antlr,它 - 给定有效的、明确的语法 - 能够为您提供考虑这些细节的标记。例如,如果注释被定义为“//”直至 EOL,它将返回一个注释标记,该标记将取代任何字符串字符或内部的任何内容。
像这样的解析器的标准方法是一次读取一个字符流(或更具体地说是标记),因此突出显示不取决于您定义的规则的顺序,而是取决于它们在流中出现的顺序。
例如,一个字符串可以是两个双引号以及其间的所有内容(另一个双引号除外)。注释是两个斜杠以及直到行尾的所有内容。
解析时,如果您找到双引号,那么您的程序将进入“我认为这是一个字符串”模式,一旦找到匹配的结束引号,它就会确认一个字符串标记,并返回它以突出显示。类似地,如果它找到两个斜杠,那么它会进行搜索,直到找到行尾(或实际上是文件尾),然后将其作为突出显示的标记返回。
当存在多个可能的匹配规则时,例如单行和多行注释,事情会变得更加复杂。如果你抓住一个斜杠字符,你的程序需要读取另一个字符,然后才能拒绝其中一些选项,即直到它得到第二个斜杠或*,然后它才会知道它是什么类型的令牌。
从根本上来说,这一切归结为状态机。你可以尝试构建自己的,或者你可以得到像 Antlr 这样的东西,给它提供语法,然后让它为你完成所有的工作。
You might struggle to do this with regex because it doesn't help your syntax highlighting understand the context, i.e. a regex will match something that appears anywhere, irrespective of whether it's part of a larger possible match.
You need to investigate parser generators such as Antlr which - given a valid, unambiguous grammar - are capable of giving you tokens that take these details into account. E.g. if a comment is defined as "//" up to an EOL, it will return a comment token, which will supersede any string characters or whatever inside.
The standard approach for parsers like this is to read in the stream of characters (or tokens, more specifically) one at a time, so the highlighting depends not on the order of the rules you define but the order of their appearance in the stream.
For example, a string could be two double-quote marks and everything in between (except for another double quote). A comment is two slashes and everything up to the end-of-line.
When parsing, if you find a double-quote then your program goes into a "I think it's a string" mode and once it finds the matching end-quote it confirms a string token, and returns it for highlighting. Similarly, if it finds two slashes then it searches until it finds an end-of-line (or end-of-file, in reality), then returns that as a token for highlighting.
It gets more complicated when there are multiple possible matching rules, e.g. for single and multi line comments. If you grab a single slash character, your program needs to read another character before it can reject some of those options, i.e. until it gets either a second slash or * then it won't know what sort of token it's in.
Fundamentally it all comes down to state machines. You could try building your own, or you could get something like Antlr, feed it a grammar, and let it do all your work for you.