Lex（词法分析器）中正则表达式的大问题

发布于 2024-08-26 17:55:49 字数 1232 浏览 5 评论 0原文

我有一些这样的内容：

    author = "Marjan Mernik  and Viljem Zumer",
    title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
    year = 1999

    author = "Manfred Broy and Martin Wirsing",
    title = "Generalized
             Heterogeneous Algebras and
             Partial Interpretations",
    year = 1983

    author = "Ikuo Nakata and Masataka Sassa",
    title = "L-Attributed LL(1)-Grammars are
             LR-Attributed",
    journal = "Information Processing Letters"

我需要捕获 title 双引号之间的所有内容。我的第一次尝试是这样的：

^(" "|\t)+"title"" "*=" "*"\"".+"\","

这捕获了第一个示例，但是不是另外两个。另一个有多条线，这就是问题所在。我想在某处更改为 \n 允许多行，如下所示：

^(" "|\t)+"title"" "*=" "*"\" "(.|\n)+"\","

但这并没有帮助，相反，它捕获了所有内容。

相比之下，“我想要的是双引号之间的内容，如果我捕获所有内容直到找到另一个 ” 后跟 , 会怎样？这样我就可以知道我是否位于 title 的末尾，无论行数如何，如下所示：

^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","

但这还有另一个问题...上面的例子没有它，但是双引号符号(") 可以位于 title 声明之间。例如：

title = "aaaaaaa \"X bbbbbb",

是的，它前面总是有一个反斜杠 (\)。

任何建议修复这个正则表达式？

原文

I have some content like this:

    author = "Marjan Mernik  and Viljem Zumer",
    title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
    year = 1999

    author = "Manfred Broy and Martin Wirsing",
    title = "Generalized
             Heterogeneous Algebras and
             Partial Interpretations",
    year = 1983

    author = "Ikuo Nakata and Masataka Sassa",
    title = "L-Attributed LL(1)-Grammars are
             LR-Attributed",
    journal = "Information Processing Letters"

And I need to catch everything between double quotes for title. My first try was this:

^(" "|\t)+"title"" "*=" "*"\"".+"\","

Which catches the first example, but not the other two. The other have multiple lines and that's the problem. I though about changing to something with \n somewhere to allow multiple lines, like this:

^(" "|\t)+"title"" "*=" "*"\""(.|\n)+"\","

But this doesn't help, instead, it catches everything.

Than I though, "what I want is between double quotes, what if I catch everything until I find another " followed by ,? This way I could know if I was at the end of the title or not, no matter the number of lines, like this:

^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","

But this has another problem... The example above doesn't have it, but the double quote symbol (") can be in between the title declaration. For instance:

title = "aaaaaaa \"X bbbbbb",

And yes, it will always be preceded by a backslash (\).

Any suggestions to fix this regexp?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

骑趴 2024-09-02 17:55:49

匹配双引号中的字符串的经典正则表达式是：

\"([^\"]|\\.)*\"

在你的情况下，你会想要这样的东西：

"title"\ *=\ *\"([^\"]|\\.)*\"

PS：恕我直言，你在正则表达式中放置了太多引号，很难阅读。

The classical regex to match strings in double quotes is:

\"([^\"]|\\.)*\"

In your case, you'll want something like this:

"title"\ *=\ *\"([^\"]|\\.)*\"

PS: IMHO, you're putting too many quotes in your regexes, it's hard to read.

回复收藏 0 原文

甜点 2024-09-02 17:55:49

您可以使用启动条件来简化每个单独的模式，例如：

%x title
%%
"title"\ *=\ *\"  { /* mark title start */
  BEGIN(title);
  fputs("found title = <|", yyout);
}

<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
  ECHO;
}

<title>\\. { /* process escapes inside title */
  char c = *(yytext + 1);
  fputc(c, yyout); /* double escaped characters */
  fputc(c, yyout);
}

<title>\" { /* mark end of title */
  fputs("|>", yyout);
  BEGIN(0); /* continue as usual */
}

要生成可执行文件：

$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl

运行它：

$ ./parse_ini < input.txt

其中 input.txt 是：

author = "Marjan\" Mernik  and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999

输出：

author = "Marjan\" Mernik  and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999

它替换了 '"' 周围标题由 '<|' 和 '|>' 组成，'\"'` 也被标题内的 '""' 替换。

You could use start conditions to simplify each separate pattern, for example:

%x title
%%
"title"\ *=\ *\"  { /* mark title start */
  BEGIN(title);
  fputs("found title = <|", yyout);
}

<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
  ECHO;
}

<title>\\. { /* process escapes inside title */
  char c = *(yytext + 1);
  fputc(c, yyout); /* double escaped characters */
  fputc(c, yyout);
}

<title>\" { /* mark end of title */
  fputs("|>", yyout);
  BEGIN(0); /* continue as usual */
}

To make an executable:

$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl

Run it:

$ ./parse_ini < input.txt

Where input.txt is:

author = "Marjan\" Mernik  and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999

Output:

author = "Marjan\" Mernik  and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999

It replaced '"' around the title by '<|' and '|>'. Also'\"'` is replaced by '""' inside title.

回复收藏 0 原文

~没有更多了~

关于作者

所谓喜欢

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

Lex（词法分析器）中正则表达式的大问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

Lex（词法分析器）中正则表达式的大问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。