Lex(词法分析器)中正则表达式的大问题

发布于 2024-08-26 17:55:49 字数 1232 浏览 5 评论 0原文

我有一些这样的内容:

    author = "Marjan Mernik  and Viljem Zumer",
    title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
    year = 1999

    author = "Manfred Broy and Martin Wirsing",
    title = "Generalized
             Heterogeneous Algebras and
             Partial Interpretations",
    year = 1983

    author = "Ikuo Nakata and Masataka Sassa",
    title = "L-Attributed LL(1)-Grammars are
             LR-Attributed",
    journal = "Information Processing Letters"

我需要捕获 title 双引号之间的所有内容。我的第一次尝试是这样的:

^(" "|\t)+"title"" "*=" "*"\"".+"\","

这捕获了第一个示例,但是不是另外两个。另一个有多条线,这就是问题所在。我想在某处更改为 \n 允许多行,如下所示:

^(" "|\t)+"title"" "*=" "*"\" "(.|\n)+"\","

但这并没有帮助,相反,它捕获了所有内容

相比之下,“我想要的是双引号之间的内容,如果我捕获所有内容直到找到另一个 后跟 , 会怎样?这样我就可以知道我是否位于 title 的末尾,无论行数如何,如下所示:

^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","

但这还有另一个问题...上面的例子没有它,但是双引号符号(") 可以位于 title 声明之间。例如:

title = "aaaaaaa \"X bbbbbb",

是的,它前面总是有一个反斜杠 (\)。

任何建议修复这个正则表达式?

I have some content like this:

    author = "Marjan Mernik  and Viljem Zumer",
    title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
    year = 1999

    author = "Manfred Broy and Martin Wirsing",
    title = "Generalized
             Heterogeneous Algebras and
             Partial Interpretations",
    year = 1983

    author = "Ikuo Nakata and Masataka Sassa",
    title = "L-Attributed LL(1)-Grammars are
             LR-Attributed",
    journal = "Information Processing Letters"

And I need to catch everything between double quotes for title. My first try was this:

^(" "|\t)+"title"" "*=" "*"\"".+"\","

Which catches the first example, but not the other two. The other have multiple lines and that's the problem. I though about changing to something with \n somewhere to allow multiple lines, like this:

^(" "|\t)+"title"" "*=" "*"\""(.|\n)+"\","

But this doesn't help, instead, it catches everything.

Than I though, "what I want is between double quotes, what if I catch everything until I find another " followed by ,? This way I could know if I was at the end of the title or not, no matter the number of lines, like this:

^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","

But this has another problem... The example above doesn't have it, but the double quote symbol (") can be in between the title declaration. For instance:

title = "aaaaaaa \"X bbbbbb",

And yes, it will always be preceded by a backslash (\).

Any suggestions to fix this regexp?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

骑趴 2024-09-02 17:55:49

匹配双引号中的字符串的经典正则表达式是:

\"([^\"]|\\.)*\"

在你的情况下,你会想要这样的东西:

"title"\ *=\ *\"([^\"]|\\.)*\"

PS:恕我直言,你在正则表达式中放置了太多引号,很难阅读。

The classical regex to match strings in double quotes is:

\"([^\"]|\\.)*\"

In your case, you'll want something like this:

"title"\ *=\ *\"([^\"]|\\.)*\"

PS: IMHO, you're putting too many quotes in your regexes, it's hard to read.

甜点 2024-09-02 17:55:49

您可以使用启动条件来简化每个单独的模式,例如:

%x title
%%
"title"\ *=\ *\"  { /* mark title start */
  BEGIN(title);
  fputs("found title = <|", yyout);
}

<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
  ECHO;
}

<title>\\. { /* process escapes inside title */
  char c = *(yytext + 1);
  fputc(c, yyout); /* double escaped characters */
  fputc(c, yyout);
}

<title>\" { /* mark end of title */
  fputs("|>", yyout);
  BEGIN(0); /* continue as usual */
}

要生成可执行文件:

$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl

运行它:

$ ./parse_ini < input.txt 

其中 input.txt 是:

author = "Marjan\" Mernik  and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999

输出:

author = "Marjan\" Mernik  and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999

它替换了 '"' 周围标题由 '<|''|>' 组成,'\"'` 也被标题内的 '""' 替换。

You could use start conditions to simplify each separate pattern, for example:

%x title
%%
"title"\ *=\ *\"  { /* mark title start */
  BEGIN(title);
  fputs("found title = <|", yyout);
}

<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
  ECHO;
}

<title>\\. { /* process escapes inside title */
  char c = *(yytext + 1);
  fputc(c, yyout); /* double escaped characters */
  fputc(c, yyout);
}

<title>\" { /* mark end of title */
  fputs("|>", yyout);
  BEGIN(0); /* continue as usual */
}

To make an executable:

$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl

Run it:

$ ./parse_ini < input.txt 

Where input.txt is:

author = "Marjan\" Mernik  and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999

Output:

author = "Marjan\" Mernik  and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999

It replaced '"' around the title by '<|' and '|>'. Also'\"'` is replaced by '""' inside title.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文