备份的弹性成本很高?
这是flex中生成的c文件的摘录(修改 /重新格式化以确保可读性)。
yy_find_action:
yy_act = yy_current_state[-1].yy_nxt;
/* YY_DO_BEFORE_ACTION */
yyg->yytext_ptr = yy_bp;
yyg->yytext_ptr -= yyg->yy_more_len;
yyleng = (int) (yy_cp - yyg->yytext_ptr);
yyg->yy_hold_char = *yy_cp;
*yy_cp = '\0';
yyg->yy_c_buf_p = yy_cp;
// this condition will always be false when yy_act = 0
if ( yy_act != YY_END_OF_BUFFER && yy_rule_can_match_eol[yy_act] )
{
...
}
do_action:
switch ( yy_act )
{
case 0: /* must back up */
/* undo the effects of YY_DO_BEFORE_ACTION */
*yy_cp = yyg->yy_hold_char;
yy_cp = yyg->yy_last_accepting_cpos + 1;
yy_current_state = yyg->yy_last_accepting_state;
goto yy_find_action;
...
}
在上面的代码中,首先,Flex正在获取操作(YY_ACT)。然后,它正在设置执行操作(yy_do_before_action)的设置。然后,它正在检查操作是否为“备份”(情况0)。然后,它再次撤消并重做所有设置。
如果您看到 ,备份对生成的扫描仪具有第三高的性能影响。如果在设置之前完成了检查备份,该性能不会提高吗?为什么生成的代码这样做,我看不到任何好处?
This is an excerpt from the generated c file in flex (modified / reformatted a little for readability).
yy_find_action:
yy_act = yy_current_state[-1].yy_nxt;
/* YY_DO_BEFORE_ACTION */
yyg->yytext_ptr = yy_bp;
yyg->yytext_ptr -= yyg->yy_more_len;
yyleng = (int) (yy_cp - yyg->yytext_ptr);
yyg->yy_hold_char = *yy_cp;
*yy_cp = '\0';
yyg->yy_c_buf_p = yy_cp;
// this condition will always be false when yy_act = 0
if ( yy_act != YY_END_OF_BUFFER && yy_rule_can_match_eol[yy_act] )
{
...
}
do_action:
switch ( yy_act )
{
case 0: /* must back up */
/* undo the effects of YY_DO_BEFORE_ACTION */
*yy_cp = yyg->yy_hold_char;
yy_cp = yyg->yy_last_accepting_cpos + 1;
yy_current_state = yyg->yy_last_accepting_state;
goto yy_find_action;
...
}
In the above code, first, flex is getting the action (yy_act). then it is setting up for executing the action (YY_DO_BEFORE_ACTION). Then it is checking if the action is 'backup' (case 0). Then it is again undoing and redoing all the setup.
If you see the Flex
documentation on performance, backing up has 3rd highest performance impact on the generated scanner. Doesn't this performance improve if checking for backup is done before setting up? Why the generated code is doing it this way that I can't see any benefit of?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
flex是围绕“对常见情况进行优化”的原则编写的。有时,这会导致在权衡中故意悲观罕见的案件,因为罕见案件的少量收益远远超过了很少发生的情况的成本。
大多数Lexers都有一个或两个州需要备份的状态,但实际上很少见。例如,如果输入包含
..
没有另一个。
,则C需要备份,但实际上这实际上从未发生在真实程序中。需要备份的最常见(或最少罕见的)情况是,需要在令牌中间重新填充缓冲区时,这将发生一次每16k的输入。可能是1000个令牌左右。那么,哪些成本高:1000个失败的测试或少于十几个指令终止令牌然后撤消它?现代分支的预测可能会大大降低失败测试的成本(比撰写该代码时的典型情况要多得多),但即使如此,这似乎并不是一场胜利。在其他每个操作中都包含令牌确定的替代方法可能会更快,但是它大大增加了代码的大小,这也具有执行成本。
另外,说“备份对生成的扫描仪具有第三高的性能影响”有点误导。列表上的前两个项目的性能影响很大。如果剩余的项目可以忽略不计,会影响。
可能备份的扫描仪的最大影响是,它迫使Lexer保存当前状态和光标,以减少所有字符,这可能是令牌的末端;无论当前令牌是否最终需要备份,都需要支付这笔费用。因此,它与您粘贴的代码无关,这仅适用于每个令牌。换句话说,成本不是备份本身,而是簿记以使备份成为可能。
同样,您应该尝试避免编写可能产生较大备份的规则(包括具有无限尾声上下文的规则),因为这可能会导致过度扫描。在病理情况下,这可能会导致二次扫描时间,尽管这不太可能出现在实际语法中。 (一个示例是一对规则
- *:
和-
,它需要二次时间来分析由大量- 接下来是一个
更常见的情况 。 /code>(在未终止的字符串字面的情况下,它与引号的字符串匹配),这将导致初始“ ”,除了可能的效率低下最好添加另一个规则,例如<代码> []([^“ \\ n] | \\(。| \ n))*\\?,哪些将不访问未终止的文字。
Flex is written around the principle, "optimise for the common case". That sometimes leads to deliberately pessimising uncommon cases in a tradeoff, because the small gain on uncommon cases far outweighs the cost of a situation which rarely occurs.
Most lexers have one or two states which will require backup but it's actually pretty rare. For example, C requires backup in the case that the input contains
..
not followed by another.
, but that practically never actually occurs in a real program. The most common (or least uncommon) situation in which backup is necessary is when the buffer needs to be refilled in the middle of a token, something that will happen once every 16K of input. That might be 1000 tokens or so.So which costs more: a 1000 failed tests or less than a dozen instructions to terminate a token and then undo it? Modern branch prediction probably reduces the cost of the failed tests considerably (much more than was typical when that code was written) but even so it doesn't seem like it would be a win. The alternative of including the token finalization in every other action might be faster, but it significantly increases the size of the code, which also has an execution cost.
Also, saying that "backing up has 3rd highest performance impact on the generated scanner" is a little misleading. The first two items on the list have a large performance impact; the impact if the remaining items is negligible.
The largest impact for scanners which might backup is that it forces the lexer to save the current state and cursor for every character which might be the end of a token; that cost needs to be paid whether or not the current token will eventually require backup. So it has little to do with the code you pasted, which only runs once for every token. In other words, the cost is not the backup itself, but rather the bookkeeping needed to make backing up possible.
All the same, you should try to avoid writing rules which might produce large backup (including rules with unbounded trailing context) because that can result in excess scanning. In pathological cases, it can result in quadratic scanning time, although that's unlikely to show up in a practical grammar. (An example would be the pair of rules
-*:
and-
which will take quadratic time to analyse an input consisting of a large number of-
followed by something other than a colon.A more common case with unbounded backup is rules such as
["]([^"\\\n]|\\(.|\n))*["]
(which matches a quoted string literal). In the case of an unterminated string literal, this will cause a backup to the initial"
, which aside from possible inefficiency, really is not what you want. Better to add another rule like["]([^"\\\n]|\\(.|\n))*\\?
, which will match unterminated literals without backup.