是否可以设置规则的优先级以避免“最长-最早”的情况?匹配模式?

发布于 2024-12-19 14:59:28 字数 510 浏览 5 评论 0原文

另一个简单的问题:有什么方法可以告诉 flex 更喜欢匹配较短事物的规则而不是匹配较长事物的规则?我找不到任何关于这方面的好的文档。

这就是我需要它的原因:我解析一个伪语言文件,其中包含一些与控制指令相对应的关键字。我希望它们成为绝对优先级,这样它们就不会被解析为表达式的一部分。我实际上需要这个优先事项,因为我不必为我的项目编写完整的语法(在我的情况下这完全是矫枉过正,因为我对解析的程序进行结构分析,我不需要知道细节.. .),所以我无法使用精细的语法调整来确保这些块不会被解析为表达式。

任何帮助将不胜感激。

这是一个解析文件的示例:

If a > 0 Then read(b); Endif
c := "If I were...";
While d > 5 Do d := d + 1 Endwhile

我只想收集有关 If、Then、Endif 等的信息...其余的对我来说并不重要。这就是为什么我希望对 Ifs、Thens 等相关规则进行优先级排序,而不必编写语法。

Another simple question : is there any way to tell flex to prefer a rule that matches a short thing over a rule that matches a longer thing ? I can't find any good documentation about that.

Here is why I need that : I parse a file for a pseudo language that contains some keywords corresponding to control instructions. I'd like them to be the absolute priority so that they're not parsed as parts of an expression. I actually need this priority thing because I don't have to write a full grammar for my project (that would be totally overkill in my case since I perform structural analysis on the program parsed, I don't need to know the details...), so I can't use a fine grammar tuning to be sure that those blocks won't be parsed into an expression.

Any help will be appreciated.

Here is an example of a file parsed :

If a > 0 Then read(b); Endif
c := "If I were...";
While d > 5 Do d := d + 1 Endwhile

I just want to collect info on the Ifs, Thens, Endifs etc... The rest doesn't matter to me. That's why I'd like the Ifs, Thens etc... related rules to be prioritized without to have to write a grammar.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

桃扇骨 2024-12-26 14:59:28

来自龙书第二版,第3.5.3节“Lex中的冲突解决”:

We have alluded to the two rules that Lex uses to decide on the proper lexeme
to select, when several prefixes of the input match one or more patterns:
    1. Always prefer a longer prefix to a shorter prefix.
    2. If the longest possible prefix matches two or more patterns, prefer the
       pattern listed first in the Lex program.

上述规则也适用于Flex。这是 Flex 手册的内容(第 7 章:如何匹配输入。)

When the generated scanner is run, it analyzes its input looking for strings 
which match any of its patterns. If it finds more than one match, it takes the 
one matching the most text (for trailing context rules, this includes the length 
of the trailing part, even though it will then be returned to the input). If it 
finds two or more matches of the same length, the rule listed first in the flex 
input file is chosen.

如果我理解正确,您的词法分析器会将 Endif 等关键字视为标识符,因此它将之后被视为表达式的一部分。如果这是您的问题,只需将关键字规则放在您的规范之上,如下所示:(假设每个大写单词都是与标记对应的预定义枚举)

"If"                      { return IF;         }
"Then"                    { return THEN;       }
"Endif"                   { return ENDIF;      }
"While"                   { return WHILE;      }
"Do"                      { return DO;         }
"EndWhile"                { return ENDWHILE;   }
\"(\\.|[^\\"])*\"         { return STRING;     }
[a-zA-Z_][a-zA-Z0-9_]*    { return IDENTIFIER; }

那么关键字将始终由于规则 2,在标识符之前匹配。

编辑:

感谢您的评论,kol。我忘记添加字符串规则。 但我不认为我的解决方案是错误的。例如,如果标识符名为 If_this_is_an_identifier,则规则 1 将适用,因此标识符规则将生效(因为它匹配最长的字符串)。我编写了一个简单的测试用例,发现我的解决方案没有问题。这是我的 lex.l 文件:

%{
  #include <iostream>
  using namespace std;
%}

ID       [a-zA-Z_][a-zA-Z0-9_]*

%option noyywrap
%%

"If"                      { cout << "IF: " << yytext << endl;         }
"Then"                    { cout << "THEN: " << yytext << endl;       }
"Endif"                   { cout << "ENDIF: " << yytext << endl;      }
"While"                   { cout << "WHILE: " << yytext << endl;      }
"Do"                      { cout << "DO: " << yytext << endl;         }
"EndWhile"                { cout << "ENDWHILE: " << yytext << endl;   }
\"(\\.|[^\\"])*\"         { cout << "STRING: " << yytext << endl;     }
{ID}                      { cout << "IDENTIFIER: " << yytext << endl; }
.                         { cout << "Ignore token: " << yytext << endl; }

%%

int main(int argc, char* argv[]) {
  ++argv, --argc;  /* skip over program name */
  if ( argc > 0 )
    yyin = fopen( argv[0], "r" );
  else
    yyin = stdin;

  yylex();
}

我使用以下测试用例测试了我的解决方案:

If If_this_is_an_identifier > 0 Then read(b); Endif
    c := "If I were...";
While While_this_is_also_an_identifier > 5 Do d := d + 1 Endwhile

它给出了以下输出(与您提到的问题无关的其他输出将被忽略。)

IF: If
IDENTIFIER: If_this_is_an_identifier
......
STRING: "If I were..."
......
WHILE: While
IDENTIFIER: While_this_is_also_an_identifier

lex.l 程序是根据示例进行修改的来自 flex 手册:(使用相同的方法来匹配关键字超出标识符)

还有一个查看ANSI C语法、Lex规范

我在我的个人项目中也使用了这种方法,到目前为止我没有发现任何问题。

From the Dragon Book 2nd edition, Section 3.5.3 "Conflict Resolution in Lex":

We have alluded to the two rules that Lex uses to decide on the proper lexeme
to select, when several prefixes of the input match one or more patterns:
    1. Always prefer a longer prefix to a shorter prefix.
    2. If the longest possible prefix matches two or more patterns, prefer the
       pattern listed first in the Lex program.

The rule above also applies to Flex. Here is what the Flex manual says (Chapter 7: How the input is matched.)

When the generated scanner is run, it analyzes its input looking for strings 
which match any of its patterns. If it finds more than one match, it takes the 
one matching the most text (for trailing context rules, this includes the length 
of the trailing part, even though it will then be returned to the input). If it 
finds two or more matches of the same length, the rule listed first in the flex 
input file is chosen.

If I understood correctly, your lexer treats keywords like Endif as an identifier, so it will be considered as part of an expression afterwards. If this is your problem, simply put the rules of keywords on top of your specification, such as the following: (suppose each word in uppercase is a predefined enum corresponding to a token)

"If"                      { return IF;         }
"Then"                    { return THEN;       }
"Endif"                   { return ENDIF;      }
"While"                   { return WHILE;      }
"Do"                      { return DO;         }
"EndWhile"                { return ENDWHILE;   }
\"(\\.|[^\\"])*\"         { return STRING;     }
[a-zA-Z_][a-zA-Z0-9_]*    { return IDENTIFIER; }

Then the keywords will always matched before the identifier due to Rule No. 2.

EDIT:

Thank you for your comment, kol. I forgot to add the rule for string. But I don't think my solution is wrong. for example, if an identifier called If_this_is_an_identifier, rule 1 will apply, thus the identifier rule will take effect (Since it matches the longest string). I wrote a simple test case and saw no problem in my solution. Here is my lex.l file:

%{
  #include <iostream>
  using namespace std;
%}

ID       [a-zA-Z_][a-zA-Z0-9_]*

%option noyywrap
%%

"If"                      { cout << "IF: " << yytext << endl;         }
"Then"                    { cout << "THEN: " << yytext << endl;       }
"Endif"                   { cout << "ENDIF: " << yytext << endl;      }
"While"                   { cout << "WHILE: " << yytext << endl;      }
"Do"                      { cout << "DO: " << yytext << endl;         }
"EndWhile"                { cout << "ENDWHILE: " << yytext << endl;   }
\"(\\.|[^\\"])*\"         { cout << "STRING: " << yytext << endl;     }
{ID}                      { cout << "IDENTIFIER: " << yytext << endl; }
.                         { cout << "Ignore token: " << yytext << endl; }

%%

int main(int argc, char* argv[]) {
  ++argv, --argc;  /* skip over program name */
  if ( argc > 0 )
    yyin = fopen( argv[0], "r" );
  else
    yyin = stdin;

  yylex();
}

I tested my solution with the following test case:

If If_this_is_an_identifier > 0 Then read(b); Endif
    c := "If I were...";
While While_this_is_also_an_identifier > 5 Do d := d + 1 Endwhile

and it gives me the following output (other output not relevant to the problem you mentioned is ignored.)

IF: If
IDENTIFIER: If_this_is_an_identifier
......
STRING: "If I were..."
......
WHILE: While
IDENTIFIER: While_this_is_also_an_identifier

The lex.l program is modified base on an example from the flex manual: (which use the same method to match keyword out of identifiers)

Also have a look at the ANSI C grammar, Lex specification.

I also used this approach in my personal project, and so far I didn't find any problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文