Pyparsing 中的关键字匹配：非贪婪地吞食标记

发布于 2024-08-14 12:28:35 字数 800 浏览 6 评论 0原文

Pythonistas：

假设您想使用 Pyparsing 解析以下字符串：

'ABC_123_SPEED_X 123'

ABC_123 是一个标识符； SPEED_X 是一个参数，123 是一个值。我想到了使用 Pyparsing 的以下 BNF：

Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)

如果我从中间删除下划线（并相应地调整 Entry 定义），它就会正确解析。

我怎样才能让这个解析器变得更懒一点，并等待它与关键字匹配（而不是将整个字符串作为标识符并等待 _，它不存在。

谢谢。

< em>[注意：这是对我的问题的完全重写；我没有意识到真正的问题是什么]

原文

Pythonistas:

Suppose you want to parse the following string using Pyparsing:

'ABC_123_SPEED_X 123'

were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:

Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)

If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.

How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.

Thank you.

[Note: This is a complete rewrite of my question; I had not realized what the real problem was]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怪异←思 2024-08-21 12:28:35

我的答案基于这个，因为你想要要做的就是得到一个非贪婪的匹配。这在 pyparsing 中似乎很难实现，但通过一些聪明和妥协并非不可能。以下似乎可行：

from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value

当我们从交互式解释器运行它时，我们可以看到以下内容：

>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})

请注意，这是一个折衷方案；因为我使用 SkipTo，Identifier 可能充满邪恶、恶心的字符，而不仅仅是漂亮的 alphanums 偶尔带有下划线。

编辑：感谢 Paul McGuire，我们可以通过将 Identifier 设置为以下内容来设计一个真正优雅的解决方案：

Identifier = Combine(Word(alphanums) +
        ZeroOrMore('_' + ~Parameter + Word(alphanums)))

让我们检查一下它是如何工作的。首先，忽略外部的Combine；我们稍后再讨论这个。从 Word(alphanums) 开始，我们知道我们将获得参考字符串 'ABC_123_SPEED_X 123' 的 'ABC' 部分。值得注意的是，在这种情况下，我们不允许“单词”包含下划线。我们将其单独构建到逻辑中。

接下来，我们需要捕获 '_123' 部分，同时又不吸收 '_SPEED_X'。我们现在也跳过 ZeroOrMore 并稍后返回。我们将下划线作为 Literal 开始，但我们可以仅使用 '_' 进行快捷方式，这将为我们提供前导下划线，但不是全部 '_123 '。本能地，我们会放置另一个 Word(alphanums) 来捕获其余部分，但这正是消耗所有剩余 '_123_SPEED_X' 会给我们带来麻烦的原因。相反，我们说，“只要下划线后面的内容不是参数，就将其解析为我的标识符的一部分。我们声明在 pyparsing 术语中为 '_' + ~Parameter + Word(alphanums) 由于我们假设可以有任意数量的下划线 + WordButNotParameter 重复，因此我们将该表达式包装为 ZeroOrMore。 > 构造（如果您总是期望在首字母后面至少有下划线 + WordButNotParameter，则可以使用 OneOrMore。）

最后，我们需要将首字母 Word 和特殊下划线 + Word 重复包装在一起，以便它是我们知道它们是连续的，不是用空格分隔的，因此我们将整个表达式包装在 Combine 构造中，这样 'ABC _123_SPEED_X' 将引发解析错误，但是 < code>'ABC_123_SPEED_X' 将正确解析。

另请注意，我必须将 Keyword 更改为 Literal，因为前者的方式太微妙且快速。愤怒。我不信任关键字，也无法与它们匹配。

I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:

from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value

When we run this from the interactive interpreter, we can see the following:

>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})

Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.

EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:

Identifier = Combine(Word(alphanums) +
        ZeroOrMore('_' + ~Parameter + Word(alphanums)))

Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.

Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)

Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.

Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.

回复收藏 0 原文

抚笙 2024-08-21 12:28:35

如果您确定标识符永远不会以下划线结尾，则可以在定义中强制执行它：

from pyparsing import *

my_string = 'ABC_123_SPEED_X 123'

Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter  + Value
tokens = Entry.parseString(my_string)

print tokens # prints: ['ABC_123', 'SPEED_X', '123']

如果情况并非如此，但如果标识符长度是固定的，则可以像这样定义标识符：

Identifier = Word( alphanums + '_' , exact=7)

If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:

from pyparsing import *

my_string = 'ABC_123_SPEED_X 123'

Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter  + Value
tokens = Entry.parseString(my_string)

print tokens # prints: ['ABC_123', 'SPEED_X', '123']

If it's not the case but if the identifier length is fixed you can define Identifier like this:

Identifier = Word( alphanums + '_' , exact=7)

回复收藏 0 原文

离笑几人歌 2024-08-21 12:28:35

您还可以将标识符和参数解析为一个标记，并在解析操作中将它们拆分：

from pyparsing import *
import re

def split_ident_and_param(tokens):
    mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
    return [mo.group(1), mo.group(2)]

ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value

print entry.parseString("APC_123_SPEED_X 123")

上面的示例假设标识符和参数始终采用 XXX_YYY 格式（包含一个下划线）。

如果不是这种情况，则需要调整 split_ident_and_param() 方法。

You can also parse the identifier and parameter as one token, and split them in a parse action:

from pyparsing import *
import re

def split_ident_and_param(tokens):
    mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
    return [mo.group(1), mo.group(2)]

ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value

print entry.parseString("APC_123_SPEED_X 123")

The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).

If this is not the case, you need to adjust the split_ident_and_param() method.

回复收藏 0 原文

轻许诺言 2024-08-21 12:28:35

这回答了您可能也问过自己的问题：“reduce 的实际应用程序是什么？”：

>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}

编辑：

这是对原始问题的一个很好的答案。我必须研究新的

进一步编辑：

我很确定您无法执行pyparsing create 不会进行前瞻操作，因此如果您告诉它匹配 Word(alphanums + '_')，它将继续匹配字符，直到找到非字母、数字或下划线的字符。。

This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):

>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}

Edit:

This was a pretty good answer to the original question. I'll have to work on the new one.

Further edit:

I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

回复收藏 0 原文

~没有更多了~