Lex(词法分析器)中正则表达式的大问题
我有一些这样的内容:
author = "Marjan Mernik and Viljem Zumer",
title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
year = 1999
author = "Manfred Broy and Martin Wirsing",
title = "Generalized
Heterogeneous Algebras and
Partial Interpretations",
year = 1983
author = "Ikuo Nakata and Masataka Sassa",
title = "L-Attributed LL(1)-Grammars are
LR-Attributed",
journal = "Information Processing Letters"
我需要捕获 title 双引号之间的所有内容。我的第一次尝试是这样的:
^(" "|\t)+"title"" "*=" "*"\"".+"\","
这捕获了第一个示例,但是不是另外两个。另一个有多条线,这就是问题所在。我想在某处更改为 \n
允许多行,如下所示:
^(" "|\t)+"title"" "*=" "*"\" "(.|\n)+"\","
但这并没有帮助,相反,它捕获了所有内容。
相比之下,“我想要的是双引号之间的内容,如果我捕获所有内容直到找到另一个 ”
后跟 ,
会怎样?这样我就可以知道我是否位于 title 的末尾,无论行数如何,如下所示:
^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","
但这还有另一个问题...上面的例子没有它,但是双引号符号("
) 可以位于 title 声明之间。例如:
title = "aaaaaaa \"X bbbbbb",
是的,它前面总是有一个反斜杠 (\
)。
任何建议修复这个正则表达式?
I have some content like this:
author = "Marjan Mernik and Viljem Zumer",
title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
year = 1999
author = "Manfred Broy and Martin Wirsing",
title = "Generalized
Heterogeneous Algebras and
Partial Interpretations",
year = 1983
author = "Ikuo Nakata and Masataka Sassa",
title = "L-Attributed LL(1)-Grammars are
LR-Attributed",
journal = "Information Processing Letters"
And I need to catch everything between double quotes for title. My first try was this:
^(" "|\t)+"title"" "*=" "*"\"".+"\","
Which catches the first example, but not the other two. The other have multiple lines and that's the problem. I though about changing to something with \n
somewhere to allow multiple lines, like this:
^(" "|\t)+"title"" "*=" "*"\""(.|\n)+"\","
But this doesn't help, instead, it catches everything.
Than I though, "what I want is between double quotes, what if I catch everything until I find another "
followed by ,
? This way I could know if I was at the end of the title or not, no matter the number of lines, like this:
^(" "|\t)+"title"" "*=" "*"\""[^"\""]+","
But this has another problem... The example above doesn't have it, but the double quote symbol ("
) can be in between the title declaration. For instance:
title = "aaaaaaa \"X bbbbbb",
And yes, it will always be preceded by a backslash (\
).
Any suggestions to fix this regexp?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
匹配双引号中的字符串的经典正则表达式是:
在你的情况下,你会想要这样的东西:
PS:恕我直言,你在正则表达式中放置了太多引号,很难阅读。
The classical regex to match strings in double quotes is:
In your case, you'll want something like this:
PS: IMHO, you're putting too many quotes in your regexes, it's hard to read.
您可以使用启动条件来简化每个单独的模式,例如:
要生成可执行文件:
运行它:
其中
input.txt
是:输出:
它替换了
'"'
周围标题由'<|'
和'|>' 组成,
'\"'` 也被标题内的 '""' 替换。You could use start conditions to simplify each separate pattern, for example:
To make an executable:
Run it:
Where
input.txt
is:Output:
It replaced
'"'
around the title by'<|'
and'|>'. Also
'\"'` is replaced by '""' inside title.