Pyparsing 带有随机引号的 CSV 字符串

发布于 2024-09-01 01:01:13 字数 958 浏览 6 评论 0原文

我有一个如下所示的字符串:

<118>date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from="[email protected]",mailer="mta",client_name="example.org,[194.177.17.24]",resolved=OK,to="[email protected]",direction="in",message_length=6832079,virus="",disposition="Accept",classifier="Not,Spam",subject="=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="

我尝试使用 CSV 模块,但它不适合,因为我还没有找到忽略引用内容的方法。 Pyparsing 看起来是一个更好的答案,但我还没有找到声明所有语法的方法。

目前,我正在使用旧的 Perl 脚本来解析它,但我希望用 Python 编写它。 如果您需要我的 Perl 代码片段,我将很乐意提供。

任何帮助表示赞赏。

I have a string like the following:

<118>date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from="[email protected]",mailer="mta",client_name="example.org,[194.177.17.24]",resolved=OK,to="[email protected]",direction="in",message_length=6832079,virus="",disposition="Accept",classifier="Not,Spam",subject="=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="

I tried using CSV module and it didn't fit, cause i haven't found a way to ignore what's quoted.
Pyparsing looked like a better answer but i haven't found a way to declare all the grammars.

Currently, i am using my old Perl script to parse it, but i want this written in Python.
if you need my Perl snippet i will be glad to provide it.

Any help is appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

溺孤伤于心 2024-09-08 01:01:14

利用现有的解析器可能比使用临时正则表达式更好。

parse_http_list(s)
    Parse lists as described by RFC 2068 Section 2.

    In particular, parse comma-separated lists where the elements of
    the list may include quoted-strings.  A quoted-string could
    contain a comma.  A non-quoted string could have quotes in the
    middle.  Neither commas nor quotes count if they are escaped.
    Only double-quotes count, not single-quotes.

parse_keqv_list(l)
    Parse list of key=value strings where keys are not duplicated.

例子:

>>> pprint.pprint(urllib2.parse_keqv_list(urllib2.parse_http_list(s)))
{'<118>date': '2010-05-09',
 'classifier': 'Not,Spam',
 'client_name': 'example.org,[194.177.17.24]',
 'device_id': 'FE-2KA3F09000049',
 'direction': 'in',
 'disposition': 'Accept',
 'from': '[email protected]',
 'log_id': '0400147717',
 'log_part': '00',
 'mailer': 'mta',
 'message_length': '6832079',
 'pri': 'information',
 'resolved': 'OK',
 'session_id': 'o49CedRc021772',
 'subject':'=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=',
 'subtype': 'n/a',
 'time': '16:41:27',
 'to': '[email protected]',
 'type': 'statistics',
 'virus': ''}

It might be better to leverage an existing parser than to use ad-hoc regexs.

parse_http_list(s)
    Parse lists as described by RFC 2068 Section 2.

    In particular, parse comma-separated lists where the elements of
    the list may include quoted-strings.  A quoted-string could
    contain a comma.  A non-quoted string could have quotes in the
    middle.  Neither commas nor quotes count if they are escaped.
    Only double-quotes count, not single-quotes.

parse_keqv_list(l)
    Parse list of key=value strings where keys are not duplicated.

Example:

>>> pprint.pprint(urllib2.parse_keqv_list(urllib2.parse_http_list(s)))
{'<118>date': '2010-05-09',
 'classifier': 'Not,Spam',
 'client_name': 'example.org,[194.177.17.24]',
 'device_id': 'FE-2KA3F09000049',
 'direction': 'in',
 'disposition': 'Accept',
 'from': '[email protected]',
 'log_id': '0400147717',
 'log_part': '00',
 'mailer': 'mta',
 'message_length': '6832079',
 'pri': 'information',
 'resolved': 'OK',
 'session_id': 'o49CedRc021772',
 'subject':'=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=',
 'subtype': 'n/a',
 'time': '16:41:27',
 'to': '[email protected]',
 'type': 'statistics',
 'virus': ''}
明月松间行 2024-09-08 01:01:14

我不确定您真正在寻找什么,但

import re
data = "date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from=\"[email protected]\",mailer=\"mta\",client_name=\"example.org,[194.177.17.24]\",resolved=OK,to=\"[email protected]\",direction=\"in\",message_length=6832079,virus=\"\",disposition=\"Accept\",classifier=\"Not,Spam\",subject=\"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=\""
pattern = r"""(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)"""
print(re.findall(pattern, data))

为您提供了

[('date', '2010-05-09'), ('time', '16:41:27'), ('device_id', 'FE-2KA3F09000049'),
 ('log_id', '0400147717'), ('log_part', '00'), ('type', 'statistics'),
 ('subtype', 'n/a'), ('pri', 'information'), ('session_id', 'o49CedRc021772'),
 ('from', '"[email protected]"'), ('mailer', '"mta"'),
 ('client_name', '"example.org,[194.177.17.24]"'), ('resolved', 'OK'),
 ('to', '"[email protected]"'), ('direction', '"in"'),
 ('message_length', '6832079'), ('virus', '""'), ('disposition', '"Accept"'),
 ('classifier', '"Not,Spam"'), 
 ('subject', '"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="')
]

您可能希望随后清理带引号的字符串(使用 mystring.strip("'\""))。

表达式现在还可以正确处理带引号的字符串内的转义引号 (a="She said \"Hi!\"")。

编辑:此正

(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)

则 ):匹配标识符并将其捕获到反向引用中 1

=:匹配 =

(:将以下内容捕获到反向引用中2:

(?::以下之一:

"(?:\\.|[^\\"])*":后跟双引号由零个或多个以下字符组成:转义字符或非引号/非反斜杠字符,后跟另一个双引号

|: 或

'(?:\\.|[ ^\\'])*':参见上文,仅适用于单引号

|:或

[^\\,"']:一个字符既不是反斜杠,也不是逗号,也不是引号:

至少重复一次,尽可能多次

):捕获组号的结尾。 2.

I'm not sure what you're really looking for, but

import re
data = "date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from=\"[email protected]\",mailer=\"mta\",client_name=\"example.org,[194.177.17.24]\",resolved=OK,to=\"[email protected]\",direction=\"in\",message_length=6832079,virus=\"\",disposition=\"Accept\",classifier=\"Not,Spam\",subject=\"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=\""
pattern = r"""(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)"""
print(re.findall(pattern, data))

gives you

[('date', '2010-05-09'), ('time', '16:41:27'), ('device_id', 'FE-2KA3F09000049'),
 ('log_id', '0400147717'), ('log_part', '00'), ('type', 'statistics'),
 ('subtype', 'n/a'), ('pri', 'information'), ('session_id', 'o49CedRc021772'),
 ('from', '"[email protected]"'), ('mailer', '"mta"'),
 ('client_name', '"example.org,[194.177.17.24]"'), ('resolved', 'OK'),
 ('to', '"[email protected]"'), ('direction', '"in"'),
 ('message_length', '6832079'), ('virus', '""'), ('disposition', '"Accept"'),
 ('classifier', '"Not,Spam"'), 
 ('subject', '"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="')
]

You might want to clean up the quoted strings afterwards (using mystring.strip("'\"")).

EDIT: This regex now also correctly handles escaped quotes inside quoted strings (a="She said \"Hi!\"").

Explanation of the regex:

(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)

(\w+): Match the identifier and capture it into backreference no. 1

=: Match a =

(: Capture the following into backreference no. 2:

(?:: One of the following:

"(?:\\.|[^\\"])*": A double quote, followed by either zero or more of the following: an escaped character or a non-quote/non-backslash character, followed by another double quote

|: or

'(?:\\.|[^\\'])*': See above, just for single quotes.

|: or

[^\\,"']: one character that is neither a backslash, a comma, nor a quote.

)+: repeat at least once, as many times as possible.

): end of capturing group no. 2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文