python：用 BNF 或 pyparsing 替换正则表达式

发布于 2024-09-19 03:10:24 字数 2024 浏览 10 评论 0原文

我正在解析一个相对简单的文本，其中每一行描述一个游戏单元。我对解析技术知之甚少，所以我使用了以下临时解决方案：

class Unit:
    # rules is an ordered dictionary of tagged regex that is intended to be applied in the given order
    # the group named V would correspond to the value (if any) for that particular tag
    rules = (
        ('Level', r'Lv. (?P<V>\d+)'),
        ('DPS', r'DPS: (?P<V>\d+)'),
        ('Type', r'(?P<V>Tank|Infantry|Artillery'),
        #the XXX will be expanded into a list of valid traits
        #note: (XXX| )* wouldn't work; it will match the first space it finds,
        #and stop at that if it's in front of something other than a trait
        ('Traits', r'(?P<V>(XXX)(XXX| )*)'),
        # flavor text, if any, ends with a dot
        ('FlavorText', r'(?P<V>.*\."?$)'),
        )
    rules = collections.OrderedDict(rules)
    traits = '|'.join('All-Terrain', 'Armored', 'Anti-Aircraft', 'Motorized')
    rules['Traits'] = re.sub('XXX', effects, rules['Traits'])

    for x in rules:
        rules[x] = re.sub('<V>', '<'+x+'>', rules[x])
        rules[x] = re.compile(rules[x])

    def __init__(self, data)
        # data looks like this:
        # Lv. 5 Tank DPS: 55 Motorized Armored
        for field, regex in Item.rules.items():
            data = regex.sub(self.parse, data, 1)
        if data:
            raise ParserError('Could not parse part of the input: ' + data)

    def parse(self, m):
        if len(m.groupdict()) != 1:
            Exception('Expected a single named group')
        field, value = m.groupdict().popitem()
        setattr(self, field, value)
        return ''

它工作正常，但我觉得我达到了正则表达式能力的极限。具体来说，就 Traits 而言，该值最终是一个字符串，我需要稍后将其拆分并转换为列表：例如，在此代码中 obj.Traits 将设置为“机动装甲”，但在后来的功能改为（'机动'，'装甲'）。

我正在考虑将此代码转换为使用 EBNF 或 pyparsing 语法或类似的东西。我的目标是：

使代码更整洁且不易出错，
避免使用值列表对案例进行丑陋的处理（我需要首先在正则表达式中进行替换，然后对结果进行后处理以将字符串转换为列表）

您对使用什么以及如何重写代码有什么建议？

PS 我跳过了代码的某些部分以避免混乱；如果我在此过程中引入任何错误，抱歉 - 原始代码确实有效:)

原文

I am parsing a relatively simple text, where each line describes a game unit. I have little knowledge of parsing techniques, so I used the following ad hoc solution:

class Unit:
    # rules is an ordered dictionary of tagged regex that is intended to be applied in the given order
    # the group named V would correspond to the value (if any) for that particular tag
    rules = (
        ('Level', r'Lv. (?P<V>\d+)'),
        ('DPS', r'DPS: (?P<V>\d+)'),
        ('Type', r'(?P<V>Tank|Infantry|Artillery'),
        #the XXX will be expanded into a list of valid traits
        #note: (XXX| )* wouldn't work; it will match the first space it finds,
        #and stop at that if it's in front of something other than a trait
        ('Traits', r'(?P<V>(XXX)(XXX| )*)'),
        # flavor text, if any, ends with a dot
        ('FlavorText', r'(?P<V>.*\."?$)'),
        )
    rules = collections.OrderedDict(rules)
    traits = '|'.join('All-Terrain', 'Armored', 'Anti-Aircraft', 'Motorized')
    rules['Traits'] = re.sub('XXX', effects, rules['Traits'])

    for x in rules:
        rules[x] = re.sub('<V>', '<'+x+'>', rules[x])
        rules[x] = re.compile(rules[x])

    def __init__(self, data)
        # data looks like this:
        # Lv. 5 Tank DPS: 55 Motorized Armored
        for field, regex in Item.rules.items():
            data = regex.sub(self.parse, data, 1)
        if data:
            raise ParserError('Could not parse part of the input: ' + data)

    def parse(self, m):
        if len(m.groupdict()) != 1:
            Exception('Expected a single named group')
        field, value = m.groupdict().popitem()
        setattr(self, field, value)
        return ''

It works fine, but I feel I reached the limit of regex power. Specifically, in the case of Traits, the value ends up being a string that I need to split and convert into a list at a later point: e.g., obj.Traits would be set to 'Motorized Armored' in this code, but in a later function changed to ('Motorized', 'Armored').

I'm thinking of converting this code to use either EBNF or pyparsing grammar or something like that. My goals are:

make this code neater and less error-prone
avoid the ugly treatment of the case with a list of values (where I need do replacement inside the regex first, and later post-process the result to convert a string into a list)

What would be your suggestions about what to use, and how to rewrite the code?

P.S. I skipped some parts of the code to avoid clutter; if I introduced any errors in the process, sorry - the original code does work :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

阳光下慵懒的猫 2024-09-26 03:10:24

我开始写一个 pyparsing 的指导指南，但是看看你的规则，它们很容易转化为 pyparsing 元素本身，而不需要处理 EBNF，所以我只是做了一个快速示例：

from pyparsing import Word, nums, oneOf, Group, OneOrMore, Regex, Optional

integer = Word(nums)
level = "Lv." + integer("Level")
dps = "DPS:" + integer("DPS")
type_ = oneOf("Tank Infantry Artillery")("Type")
traits = Group(OneOrMore(oneOf("All-Terrain Armored Anti-Aircraft Motorized")))("Traits")
flavortext = Regex(r".*\.$")("FlavorText")

rule = (Optional(level) & Optional(dps) & Optional(type_) & 
        Optional(traits) & Optional(flavortext))

我包含了 Regex 示例，这样你就可以看到如何可以将正则表达式放入现有的 pyparsing 语法中。使用“&”组成rule运算符意味着可以按任何顺序找到各个项目（因此语法负责迭代所有规则，而不是在您自己的代码中执行此操作）。 Pyparsing 使用运算符重载从简单的解析器构建复杂的解析器：“+”表示序列，“|”表示序列'^' 表示替代项（第一个匹配或最长匹配），依此类推。

以下是解析结果的外观 - 请注意，我添加了结果名称，就像您在正则表达式中使用命名组一样：

data = "Lv. 5 Tank DPS: 55 Motorized Armored"

parsed_data = rule.parseString(data)
print parsed_data.dump()
print parsed_data.DPS
print parsed_data.Type
print ' '.join(parsed_data.Traits)

prints:

['Lv.', '5', 'Tank', 'DPS:', '55', ['Motorized', 'Armored']]
- DPS: 55
- Level: 5
- Traits: ['Motorized', 'Armored']
- Type: Tank
55
Tank
Motorized Armored

请访问 wiki 并查看其他示例。您可以使用 easy_install 来安装 pyparsing，但是如果您从 SourceForge 下载源代码发行版，则会有很多附加文档。

I started to write up a coaching guide for pyparsing, but looking at your rules, they translate pretty easily into pyparsing elements themselves, without dealing with EBNF, so I just cooked up a quick sample:

from pyparsing import Word, nums, oneOf, Group, OneOrMore, Regex, Optional

integer = Word(nums)
level = "Lv." + integer("Level")
dps = "DPS:" + integer("DPS")
type_ = oneOf("Tank Infantry Artillery")("Type")
traits = Group(OneOrMore(oneOf("All-Terrain Armored Anti-Aircraft Motorized")))("Traits")
flavortext = Regex(r".*\.$")("FlavorText")

rule = (Optional(level) & Optional(dps) & Optional(type_) & 
        Optional(traits) & Optional(flavortext))

I included the Regex example so you could see how a regular expression could be dropped in to an existing pyparsing grammar. The composition of rule using '&' operators means that the individual items could be found in any order (so the grammar takes care of the iterating over all the rules, instead of you doing it in your own code). Pyparsing uses operator overloading to build up complex parsers from simple ones: '+' for sequence, '|' and '^' for alternatives (first-match or longest-match), and so on.

Here is how the parsed results would look - note that I added results names, just as you used named groups in your regexen:

data = "Lv. 5 Tank DPS: 55 Motorized Armored"

parsed_data = rule.parseString(data)
print parsed_data.dump()
print parsed_data.DPS
print parsed_data.Type
print ' '.join(parsed_data.Traits)

prints:

['Lv.', '5', 'Tank', 'DPS:', '55', ['Motorized', 'Armored']]
- DPS: 55
- Level: 5
- Traits: ['Motorized', 'Armored']
- Type: Tank
55
Tank
Motorized Armored

Please stop by the wiki and see the other examples. You can easy_install to install pyparsing, but if you download the source distribution from SourceForge, there is a lot of additional documentation.

回复收藏 0 原文

~没有更多了~