PyParsing 前瞻和贪婪表达式

发布于 2025-01-04 20:08:15 字数 808 浏览 2 评论 0原文

我正在使用 PyParsing 为查询语言编写解析器，并且我陷入了（我认为是）前瞻问题。查询中的一种子句类型旨在将字符串拆分为 3 个部分（字段名、运算符、值），其中字段名是一个单词，运算符是一个或多个单词，值是一个单词、带引号的字符串或带括号的列表这些。

我的数据看起来像

author is william
author is 'william shakespeare'
author is not shakespeare
author is in (william,'the bard',shakespeare)

我当前对该子句的解析器写为：

fieldname = Word(alphas)

operator = OneOrMore(Word(alphas))

single_value = Word(alphas) ^ QuotedString(quoteChar="'")
list_value = Literal("(") + Group(delimitedList(single_value)) + Literal(")")
value = single_value ^ list_value

clause = fieldname + originalTextFor(operator) + value

显然，这会失败，因为 operator 元素是贪婪的，并且会吞噬 value if可以。通过阅读其他类似的问题和文档，我发现我需要使用 NotAny 或 FollowedBy 来管理该前瞻，但我无法弄清楚如何做到这一点。

原文

I'm writing a parser for a query language using PyParsing, and I've gotten stuck on (what I believe to be) an issue with lookaheads. One clause type in the query is intended to split strings into 3 parts (fieldname,operator, value) such that fieldname is one word, operator is one or more words, and value is a word, a quoted string, or a parenthesized list of these.

My data look like

author is william
author is 'william shakespeare'
author is not shakespeare
author is in (william,'the bard',shakespeare)

And my current parser for this clause is written as:

fieldname = Word(alphas)

operator = OneOrMore(Word(alphas))

single_value = Word(alphas) ^ QuotedString(quoteChar="'")
list_value = Literal("(") + Group(delimitedList(single_value)) + Literal(")")
value = single_value ^ list_value

clause = fieldname + originalTextFor(operator) + value

Obviously this fails due to the the fact that the operator element is greedy and will gobble up the value if it can. From reading other similar questions and the docs, I've gathered that I need to manage that lookahead with a NotAny or FollowedBy, but I haven't been able to figure out how to make that work.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灼疼热情 2025-01-11 20:08:15

这是成为解析器的好地方。或者更准确地说，让解析器像你一样思考。问问自己，“在‘作者是莎士比亚’中，我怎么知道‘莎士比亚’不是运算符的一部分？”您知道“shakespeare”是该值，因为它位于查询的末尾，后面没有其他内容。因此，运算符单词不仅仅是字母单词，它们是后面不跟有字符串末尾的字母单词。现在将先行逻辑构建到您的运算符定义中：

operator = OneOrMore(Word(alphas) + ~FollowedBy(StringEnd()))

我认为这将为您提供更好的解析。

其他一些提示：

只有在可能存在歧义的情况下，我才使用“^”运算符，例如我要解析带有整数或十六进制数字的字符串。如果我使用 Word(nums) | Word(hexnums)，那么我可能会将“123ABC”误处理为前导“123”。通过改变“|”对于“^”，将测试所有替代方案，并选择最长的匹配项。在我解析十进制或十六进制整数的示例中，我可以通过反转替代方案并首先测试 Word(hexnums) 来获得相同的结果。在您的查询语言中，无法将带引号的字符串与不带引号的单个单词值混淆（一个以 ' 或 " 开头，另一个则不然），所以没有理由使用'^'，'|'就足够了。与 value = singleValue ^ listValue 类似。
将结果名称添加到查询字符串的关键组成部分将使以后更容易使用：
clause = fieldname("fieldname") + originalTextFor(operator)("operator") + value("value")
现在您可以通过名称而不是通过解析位置访问解析的值（一旦您开始使用可选字段等变得更加复杂，这将变得棘手且容易出错）：
queryParts = Clause.parseString('作者是威廉')
打印queryParts.fieldname
print queryParts.operator

This is a good place to Be The Parser. Or more accurately, Make The Parser Think Like You Do. Ask yourself, "In 'author is shakespeare', how do I know that 'shakespeare' is not part of the operator?" You know that 'shakespeare' is the value because it is at the end of the query, there is nothing more after it. So operator words aren't just words of alphas, they are words of alphas that are not followed by the end of the string. Now build that lookahead logic into your definition of operator:

operator = OneOrMore(Word(alphas) + ~FollowedBy(StringEnd()))

And I think this will start parsing better for you.

Some other tips:

I only use '^' operator if there will be some possible ambiguity, like if I was going to parse a string with numbers that could be integers or hex. If I used Word(nums) | Word(hexnums), then I might misprocess "123ABC" as just the leading "123". By changing '|' to '^', all of the alternatives will be tested, and the longest match chosen. In my example of parsing decimal or hex integers, I could have gotten the same result by reversing the alternatives, and test for Word(hexnums) first. In you query language, there is no way to confuse a quoted string with a non-quoted single word value (one leads with ' or ", the other doesn't), so there is no reason to use '^', '|' will suffice. Similar for value = singleValue ^ listValue.
Adding results names to the key components of your query string will make it easier to work with later:
clause = fieldname("fieldname") + originalTextFor(operator)("operator") + value("value")
Now you can access the parsed values by name instead of by parse position (which will get tricky and error-prone once you start getting more complicated with optional fields and such):
queryParts = clause.parseString('author is william')
print queryParts.fieldname
print queryParts.operator