在 flex/lex(解析器生成器)中实现字边界状态

发布于 2024-07-12 04:28:38 字数 736 浏览 7 评论 0原文

我希望能够根据模式匹配是出现在单词字符之后还是出现在非单词字符之后来预测模式匹配。 换句话说,我想在 flex/lex 不支持的模式开头模拟 \b 分词正则表达式字符。

下面是我的尝试(无法按预期工作):

%{
#include <stdio.h>
%}

%x inword
%x nonword

%%
[a-zA-Z]    { BEGIN inword; yymore(); }
[^a-zA-Z]   { BEGIN nonword; yymore(); }

<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }

%%

输入:

a
ba
a

预期输出

'a' not in word
'a' in word
'a' not in word

实际输出:

a
'a' in word
'a' in word

我这样做是因为我想做类似 方言器,我一直想学习如何使用真正的词法分析器。 有时我想要替换的模式需要是单词的片段,有时它们只需要是整个单词。

I want to be able to predicate pattern matches on whether they occur after word characters or after non-word characters. In other words, I want to simulate the \b word break regex char at the beginning of the pattern which flex/lex does not support.

Here's my attempt below (which does not work as desired):

%{
#include <stdio.h>
%}

%x inword
%x nonword

%%
[a-zA-Z]    { BEGIN inword; yymore(); }
[^a-zA-Z]   { BEGIN nonword; yymore(); }

<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }

%%

Input :

a
ba
a

Expected output

'a' not in word
'a' in word
'a' not in word

actual output:

a
'a' in word
'a' in word

I'm doing this because I want to do something like the dialectizer and I have always wanted to learn how to use a real lexer. Sometimes the patterns I want to replace need to be fragments of words, sometimes they need to be whole words only.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

绅刃 2024-07-19 04:28:38

这就是实现我想要的:

%{
#include <stdio.h>
%}

WC      [A-Za-z']
NW      [^A-Za-z']

%start      INW NIW

{WC}  { BEGIN INW; REJECT; }
{NW}  { BEGIN NIW; REJECT; }

<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }

这样我可以在任何模式的开头或结尾执行与 \B 或 \b 等效的操作。 您可以通过执行 a/{WC}a/{NW} 在末尾进行匹配。

我想在不消耗任何角色的情况下设置状态。 诀窍是使用 REJECT 而不是 yymore(),我想我没有完全理解。

Here's what accomplished what I wanted:

%{
#include <stdio.h>
%}

WC      [A-Za-z']
NW      [^A-Za-z']

%start      INW NIW

{WC}  { BEGIN INW; REJECT; }
{NW}  { BEGIN NIW; REJECT; }

<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }

This way I can do the equivalent of \B or \b at the beginning or end of any pattern. You can match at the end by doing a/{WC} or a/{NW}.

I wanted to set up the states without consuming any characters. The trick is using REJECT rather than yymore(), which I guess I didn't fully understand.

七堇年 2024-07-19 04:28:38
%%
[a-zA-Z]+a[a-zA-Z]* {printf("a in word: %s\n", yytext);}
a[a-zA-Z]+ {printf("a in word: %s\n", yytext);}
a {printf("a not in word\n");}
. ;

测试:

user@cody /tmp $ ./a.out <<EOF
> a
> ba
> ab
> a
> EOF
a not in word

a in word: ba

a in word: ab

a not in word
%%
[a-zA-Z]+a[a-zA-Z]* {printf("a in word: %s\n", yytext);}
a[a-zA-Z]+ {printf("a in word: %s\n", yytext);}
a {printf("a not in word\n");}
. ;

Testing:

user@cody /tmp $ ./a.out <<EOF
> a
> ba
> ab
> a
> EOF
a not in word

a in word: ba

a in word: ab

a not in word
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文