pyparsing - 解析简单的行
我正在绞尽脑汁地思考如何完全解析这一行, 我在“( 4801)”部分遇到问题,所有其他元素都被抓取正常。
# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00
这是我到目前为止所拥有的
from pyparsing import nums, Word, Optional, Suppress, OneOrMore, Group, Combine, ParseException
unparsed_log_data = "# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00.007 Type: Periodic"
binary_name = "# MAIN_PROG"
pid = Literal("(" + nums + ")")
report_id = Combine(Suppress(binary_name) + pid)
year = Word(nums, max=4)
month = Word(nums, max=2)
day = Word(nums, max=2)
yearly_day = Combine(year + "-" + month + "-" + day)
clock24h = Combine(Word(nums, max=2) + ":" + Word(nums, max=2) + ":" + Word(nums, max=2) + Suppress("."))
timestamp = Combine(yearly_day + White(' ') + clock24h).setResultsName("timestamp")
time_bnf = report_id + Suppress("Generated at") + timestamp
time_bnf.searchString(unparsed_log_data)
编辑: 保罗,如果你有耐心, 我如何过滤
unparsed_log_data =
"""
# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00
bla bla bla
multi line garbage
bla bla
Efficiency | 38 38 100 | 3497061 3497081 99 |
more garbage
"""
time_bnf = report_id + Suppress("Generated at") + timestamp
partial_report_ignore = Suppress(SkipTo("Efficiency"))
efficiency_bnf = Suppress("|") + integer.setResultsName("Efficiency") + Suppress(integer) + integer.setResultsName("EfficiencyPercent")
两者 efficiency_bnf.searchString(unparsed_log_data) 和 report_and_effic.searchString(unparsed_log_data) 按预期返回数据, 但如果我尝试
report_and_effic = report_bnf +partial_report_ignore +efficient_bnfreport_and_effic.searchString
(unparsed_log_data) 返回 ([], {})
编辑2: 人们应该阅读代码,
partial_report_ignore = Suppress(SkipTo("效率", include=True))
I'm scratching my head on how to completely parse this line,
I'm having trouble with the '( 4801)' part, every other elements are being grabbed OK.
# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00
This is what I have so far
from pyparsing import nums, Word, Optional, Suppress, OneOrMore, Group, Combine, ParseException
unparsed_log_data = "# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00.007 Type: Periodic"
binary_name = "# MAIN_PROG"
pid = Literal("(" + nums + ")")
report_id = Combine(Suppress(binary_name) + pid)
year = Word(nums, max=4)
month = Word(nums, max=2)
day = Word(nums, max=2)
yearly_day = Combine(year + "-" + month + "-" + day)
clock24h = Combine(Word(nums, max=2) + ":" + Word(nums, max=2) + ":" + Word(nums, max=2) + Suppress("."))
timestamp = Combine(yearly_day + White(' ') + clock24h).setResultsName("timestamp")
time_bnf = report_id + Suppress("Generated at") + timestamp
time_bnf.searchString(unparsed_log_data)
EDIT:
Paul, if you have the patience,
how would I filter
unparsed_log_data =
"""
# MAIN_PROG ( 4801) Generated at 2010-01-25 06:55:00
bla bla bla
multi line garbage
bla bla
Efficiency | 38 38 100 | 3497061 3497081 99 |
more garbage
"""
time_bnf = report_id + Suppress("Generated at") + timestamp
partial_report_ignore = Suppress(SkipTo("Efficiency"))
efficiency_bnf = Suppress("|") + integer.setResultsName("Efficiency") + Suppress(integer) + integer.setResultsName("EfficiencyPercent")
Both
efficiency_bnf.searchString(unparsed_log_data) and
report_and_effic.searchString(unparsed_log_data)
return data as expected,
but if I try
report_and_effic = report_bnf + partial_report_ignore + efficiency_bnf
report_and_effic.searchString(unparsed_log_data)
returns ([], {})
EDIT2:
one should read in the code,
partial_report_ignore = Suppress(SkipTo("Efficiency", include=True))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
应该是
Pyparsing 允许您使用“+”将字符串添加到表达式对象,例如:
它被解释为:
您编写了
Literal("(" + nums + ")")
。 nums 是字符串“0123456789”,用作创建 Word 的一部分,例如Word(nums)
。所以你试图匹配的不是“左括号后跟由数字组成的单词,后跟右括号”,你试图匹配文字字符串“(0123456789)”。should be
Pyparsing allows you to add strings to expression objects using '+', like:
Which gets interpreted as:
You wrote
Literal("(" + nums + ")")
. nums is the string "0123456789", to be used as part of creating Word's, likeWord(nums)
. So what you were trying to match was not "left-paren followed by a word composed of nums followed by right-paren", you were trying to match the literal string "(0123456789)".