Pyparsing 中的关键字匹配:非贪婪地吞食标记
Pythonistas:
假设您想使用 Pyparsing 解析以下字符串:
'ABC_123_SPEED_X 123'
ABC_123
是一个标识符; SPEED_X
是一个参数,123
是一个值。我想到了使用 Pyparsing 的以下 BNF:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
如果我从中间删除下划线(并相应地调整 Entry
定义),它就会正确解析。
我怎样才能让这个解析器变得更懒一点,并等待它与关键字匹配(而不是将整个字符串作为标识符并等待 _
,它不存在。
谢谢。
< em>[注意:这是对我的问题的完全重写;我没有意识到真正的问题是什么]
Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123
is an identifier; SPEED_X
is a parameter, and 123
is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry
definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _
, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我的答案基于这个,因为你想要要做的就是得到一个非贪婪的匹配。这在 pyparsing 中似乎很难实现,但通过一些聪明和妥协并非不可能。以下似乎可行:
当我们从交互式解释器运行它时,我们可以看到以下内容:
请注意,这是一个折衷方案;因为我使用
SkipTo
,Identifier
可能充满邪恶、恶心的字符,而不仅仅是漂亮的alphanums
偶尔带有下划线。编辑:感谢 Paul McGuire,我们可以通过将
Identifier
设置为以下内容来设计一个真正优雅的解决方案:让我们检查一下它是如何工作的。首先,忽略外部的
Combine
;我们稍后再讨论这个。从Word(alphanums)
开始,我们知道我们将获得参考字符串'ABC_123_SPEED_X 123'
的'ABC'
部分。值得注意的是,在这种情况下,我们不允许“单词”包含下划线。我们将其单独构建到逻辑中。接下来,我们需要捕获
'_123'
部分,同时又不吸收'_SPEED_X'
。我们现在也跳过ZeroOrMore
并稍后返回。我们将下划线作为Literal
开始,但我们可以仅使用'_'
进行快捷方式,这将为我们提供前导下划线,但不是全部'_123 '
。本能地,我们会放置另一个Word(alphanums)
来捕获其余部分,但这正是消耗所有剩余'_123_SPEED_X'
会给我们带来麻烦的原因。相反,我们说,“只要下划线后面的内容不是参数
,就将其解析为我的标识符
的一部分。我们声明在 pyparsing 术语中为'_' + ~Parameter + Word(alphanums)
由于我们假设可以有任意数量的下划线 + WordButNotParameter 重复,因此我们将该表达式包装为ZeroOrMore
。 > 构造(如果您总是期望在首字母后面至少有下划线 + WordButNotParameter,则可以使用OneOrMore
。)最后,我们需要将首字母 Word 和特殊下划线 + Word 重复包装在一起,以便它是我们知道它们是连续的,不是用空格分隔的,因此我们将整个表达式包装在
Combine
构造中,这样'ABC _123_SPEED_X'
将引发解析错误,但是 < code>'ABC_123_SPEED_X' 将正确解析。另请注意,我必须将
Keyword
更改为Literal
,因为前者的方式太微妙且快速。愤怒。我不信任关键字
,也无法与它们匹配。I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
When we run this from the interactive interpreter, we can see the following:
Note that this is a compromise; because I use
SkipTo
, theIdentifier
can be full of evil, disgusting characters, not just beautifulalphanums
with the occasional underscore.EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting
Identifier
to the following:Let's inspect how this works. First, ignore the outer
Combine
; we'll get to this later. Starting withWord(alphanums)
we know we'll get the'ABC'
part of the reference string,'ABC_123_SPEED_X 123'
. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.Next, we need to capture the
'_123'
part without also sucking in'_SPEED_X'
. Let's also skip overZeroOrMore
at this point and return to it later. We start with the underscore as aLiteral
, but we can shortcut with just'_'
, which will get us the leading underscore, but not all of'_123'
. Instictively, we would place anotherWord(alphanums)
to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining'_123_SPEED_X'
. Instead, we say, "So long as what follows the underscore is not theParameter
, parse that as part of myIdentifier
. We state that in pyparsing terms as'_' + ~Parameter + Word(alphanums)
. Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression aZeroOrMore
construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can useOneOrMore
.)Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a
Combine
construct. This way'ABC _123_SPEED_X'
will raise a parse error, but'ABC_123_SPEED_X'
will parse correctly.Note also that I had to change
Keyword
toLiteral
because the ways of the former are far too subtle and quick to anger. I do not trustKeyword
s, nor could I get matching with them.如果您确定标识符永远不会以下划线结尾,则可以在定义中强制执行它:
如果情况并非如此,但如果标识符长度是固定的,则可以像这样定义标识符:
If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
If it's not the case but if the identifier length is fixed you can define Identifier like this:
您还可以将标识符和参数解析为一个标记,并在解析操作中将它们拆分:
上面的示例假设标识符和参数始终采用 XXX_YYY 格式(包含一个下划线)。
如果不是这种情况,则需要调整
split_ident_and_param()
方法。You can also parse the identifier and parameter as one token, and split them in a parse action:
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the
split_ident_and_param()
method.这回答了您可能也问过自己的问题:“
reduce
的实际应用程序是什么?”:编辑:
这是对原始问题的一个很好的答案。我必须研究新的
进一步编辑:
我很确定您无法执行
pyparsing
create 不会进行前瞻操作,因此如果您告诉它匹配Word(alphanums + '_')
,它将继续匹配字符,直到找到非字母、数字或下划线的字符。 。This answers a question that you probably have also asked yourself: "What's a real-world application for
reduce
?):Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that
pyparsing
creates doesn't do lookahead. So if you tell it to matchWord(alphanums + '_')
, it's going to keep matching characters until it finds one that's not a letter, number, or underscore.