pyparsing 捕获具有给定标题的任意文本组作为嵌套列表
我有一个看起来类似于的文本文件;
节标题 1:
有些词可以是任何东西
更多的单词可以是任何东西
等等啦啦其他一些标题:
和以前一样可以是任何东西
嘿,这不是很有趣
我正在尝试使用 pyparser 构建一个语法,当要求将解析结果作为列表时,它将产生以下列表结构; (IE;迭代 parsed.asList() 元素时应打印以下内容)
['节标题 1:',[['一些单词可以是任何内容'],['更多单词可以是任何内容'],['等等 lala']]]
['一些其他标题:',[['像以前一样可以是任何东西'],['嘿,这不是很有趣']]]
标头名称都是事先已知的,并且各个标头可能会出现,也可能不会出现。如果它们确实出现,则始终至少有一行内容。
我遇到的问题是,我无法让解析器识别“节标题 1:”和“其他标题:”的开始位置。我最终得到了一个 parsed.asList() 看起来像;
['节标题 1:',[[''一些单词可以是任何内容'],['更多单词可以是任何内容'],['等等 lala'],['其他一些标题'] ,[''像以前一样可以是任何东西'],['嘿这不是很有趣吗']]]
(即:节标题 1:正确显示,但其后面的所有内容都会添加到节标题 1 中,包括进一步的标题行等..
)尝试过各种东西,玩过leftWhitespace() 和 LineEnd() 以各种方式,但我无法弄清楚。
我正在使用的基本解析器是(人为的示例 - 实际上这是一个类定义等..)。
header_1_line=Literal('section header 1:')
text_line=Group(OneOrMore(Word(printables)))
header_1_block=Group(header_1_line+Group(OneOrMore(text_line)))
header_2_line=Literal('some other header:')
header_2_block=Group(header_2_line+Group(OneOrMore(text_line)))
overall_structure=ZeroOrMore(header_1_block|header_2_block)
并被召唤欢呼
parsed=overall_structure.parseFile()
,马特。
I have a text file that looks similar to;
section header 1:
some words can be anything
more words could be anything at all
etc etc lalasome other header:
as before could be anything
hey isnt this fun
I am trying to contruct a grammar with pyparser that would result in the following list structure when asking for the parsed results as a list; (IE; the following should be printed when iterating through the parsed.asList() elements)
['section header 1:',[['some words can be anything'],['more words could be anything at all'],['etc etc lala']]]
['some other header:',[['as before could be anything'],['hey isnt this fun']]]
The header names are all known beforehand, and individual headers may or may not appear. If they do appear, thre is always at least one line of content.
The problem I am having, is that I am having trouble gettnig the parser to recognise where 'section header 1:' ands, and 'some other header:' begins. I end up with a parsed.asList() looking like;
['section header 1:',[[''some words can be anything'],['more words could be anything at all'],['etc etc lala'],['some other header'],[''as before could be anything'],['hey isnt this fun']]]
(IE: section header 1: gets seen correctly, but everythng following it gets added to section header 1, including further header lines etc..)
Ive tried various things, played with leaveWhitespace() and LineEnd() in various ways but I can't figure it out.
The base parser I am hacking about with is (contrived example - in reality this is a class definition etc..).
header_1_line=Literal('section header 1:')
text_line=Group(OneOrMore(Word(printables)))
header_1_block=Group(header_1_line+Group(OneOrMore(text_line)))
header_2_line=Literal('some other header:')
header_2_block=Group(header_2_line+Group(OneOrMore(text_line)))
overall_structure=ZeroOrMore(header_1_block|header_2_block)
and is being called with
parsed=overall_structure.parseFile()
Cheers, Matt.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
马特 -
欢迎来到 pyparsing!您陷入了使用 pyparsing 时最常见的陷阱之一,那就是人比计算机更聪明。当您查看输入文本时,您可以轻松看出哪些文本可以作为标题,哪些文本不能作为标题。不幸的是,pyparsing 不是那么直观,所以你必须明确告诉它什么可以是文本,什么不能是文本。
当您查看示例文本时,您不只接受任何文本行作为节标题中的可能文本。您如何知道“其他标头:”作为文本无效?因为您知道该字符串与已知的标头字符串之一匹配。但在您当前的代码中,您已告诉 pyparsing 任何
Word(printables)
集合都是有效文本,即使该集合是有效的节标题。要解决此问题,您必须向解析器添加一些显式的前瞻。 Pyparsing 提供两种构造:NotAny 和 FollowedBy。 NotAny 可以使用“~”运算符缩写,因此我们可以为文本编写这个伪代码表达式:
这是一个完整的解析器,使用负向先行来确保您阅读每个部分,打破部分标题:
在我的第一次尝试中,我忘记了还要查找字符串的结尾,因此我的restOfLine 表达式永远循环。通过为字符串结尾添加第二个前瞻,我的程序成功终止。留给您的练习:不要枚举所有可能的标题,而是将标题行定义为以“:”结尾的任何行。
祝你的 pyparsing 工作好运,
——保罗
Matt -
Welcome to pyparsing! You have fallen into one of the most common pitfalls in working with pyparsing, and that is that people are smarter than computers. When you look at your input text, you can easily see which text can be headers and which text can't be. Unfortunately, pyparsing is not so intuitive, so you have to tell it explicitly what can and can't be text.
When you look at your sample text, you are not accepting just any line of text as possible text within a section header. How do you know that 'some other header:' is not valid as text? Because you know that that string matches one of the known header strings. But in your current code, you have told pyparsing that any collection of
Word(printables)
is valid text, even if that collection is a valid section header.To fix this, you have to add some explicit lookahead to your parser. Pyparsing offers two constructs, NotAny and FollowedBy. NotAny can be abbreviated using the '~' operator, so we can write this pseudocode expression for text:
Here is a complete parser using negative lookahead to make sure you read each section, breaking on section headings:
In my first attempt, I forgot to also look for the end of string, so my restOfLine expression looped forever. By adding a second lookahead for the string end, my program terminates successfully. Exercise left for you: instead of enumerating all possible headers, define a header line as any line that ends with a ':'.
Good luck with your pyparsing efforts,
-- Paul