Pyparsing 带有随机引号的 CSV 字符串
我有一个如下所示的字符串:
<118>date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from="[email protected]",mailer="mta",client_name="example.org,[194.177.17.24]",resolved=OK,to="[email protected]",direction="in",message_length=6832079,virus="",disposition="Accept",classifier="Not,Spam",subject="=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="
我尝试使用 CSV 模块,但它不适合,因为我还没有找到忽略引用内容的方法。 Pyparsing 看起来是一个更好的答案,但我还没有找到声明所有语法的方法。
目前,我正在使用旧的 Perl 脚本来解析它,但我希望用 Python 编写它。 如果您需要我的 Perl 代码片段,我将很乐意提供。
任何帮助表示赞赏。
I have a string like the following:
<118>date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from="[email protected]",mailer="mta",client_name="example.org,[194.177.17.24]",resolved=OK,to="[email protected]",direction="in",message_length=6832079,virus="",disposition="Accept",classifier="Not,Spam",subject="=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="
I tried using CSV module and it didn't fit, cause i haven't found a way to ignore what's quoted.
Pyparsing looked like a better answer but i haven't found a way to declare all the grammars.
Currently, i am using my old Perl script to parse it, but i want this written in Python.
if you need my Perl snippet i will be glad to provide it.
Any help is appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
利用现有的解析器可能比使用临时正则表达式更好。
例子:
It might be better to leverage an existing parser than to use ad-hoc regexs.
Example:
我不确定您真正在寻找什么,但
为您提供了
您可能希望随后清理带引号的字符串(使用
mystring.strip("'\"")
)。表达式现在还可以正确处理带引号的字符串内的转义引号 (
a="She said \"Hi!\""
)。编辑:此正
则 ):匹配标识符并将其捕获到反向引用中 1
=
:匹配=
(
:将以下内容捕获到反向引用中2:(?:
:以下之一:"(?:\\.|[^\\"])*"
:后跟双引号由零个或多个以下字符组成:转义字符或非引号/非反斜杠字符,后跟另一个双引号|
: 或'(?:\\.|[ ^\\'])*'
:参见上文,仅适用于单引号|
:或[^\\,"']
:一个字符既不是反斜杠,也不是逗号,也不是引号:至少重复一次,尽可能多次
)
:捕获组号的结尾。 2.I'm not sure what you're really looking for, but
gives you
You might want to clean up the quoted strings afterwards (using
mystring.strip("'\"")
).EDIT: This regex now also correctly handles escaped quotes inside quoted strings (
a="She said \"Hi!\""
).Explanation of the regex:
(\w+)
: Match the identifier and capture it into backreference no. 1=
: Match a=
(
: Capture the following into backreference no. 2:(?:
: One of the following:"(?:\\.|[^\\"])*"
: A double quote, followed by either zero or more of the following: an escaped character or a non-quote/non-backslash character, followed by another double quote|
: or'(?:\\.|[^\\'])*'
: See above, just for single quotes.|
: or[^\\,"']
: one character that is neither a backslash, a comma, nor a quote.)+
: repeat at least once, as many times as possible.)
: end of capturing group no. 2.