对复杂输入进行标记

发布于 2024-11-30 20:41:17 字数 1316 浏览 0 评论 0原文

我正在尝试在 Python 中标记以下输入：

text = 'This @example@ is "neither":/defn/neither complete[1] *nor* trite, *though _simple_*.'

我想生成类似以下内容的内容，同时避免使用正则表达式：

tokens = [
        ('text', 'This '),
        ('enter', 'code'),
            ('text', "example")
        ('exit', None),
        ('text', ' is '),
        ('enter', 'a'),
            ('text', "neither"),
            ('href', "/defn/neither"),
        ('exit', None),
        ('text', ' complete'),
        ('enter', 'footnote'),
            ('id', 1),
        ('exit', None),
        ('text', ' '),
        ('enter', 'strong'),
            ('text', 'nor'),
        ('exit', None),
        ('text', ' trite, '),
        ('enter', 'strong'),
                ('text', 'though '),
                ('enter', 'em'),
                    ('text', 'simple'),
                ('exit', None),
        ('exit', None),
        ('text', '.')
    ]

假装上面的内容是由生成器生成的。我的当前实现有效，尽管代码有点丑陋并且不容易扩展以支持链接。

任何帮助将不胜感激。

更新以将所需的语法从复杂的嵌套列表结构更改为简单的元组流。我们人类的缩进。在链接文本中设置格式就可以了。这是一个简单的解析器，它生成我正在寻找的词法结果，但仍然没有不处理链接或脚注。

原文

I'm attempting to tokenize the following input in Python:

text = 'This @example@ is "neither":/defn/neither complete[1] *nor* trite, *though _simple_*.'

I would like to produce something like the following while avoiding use of the regular expressions:

tokens = [
        ('text', 'This '),
        ('enter', 'code'),
            ('text', "example")
        ('exit', None),
        ('text', ' is '),
        ('enter', 'a'),
            ('text', "neither"),
            ('href', "/defn/neither"),
        ('exit', None),
        ('text', ' complete'),
        ('enter', 'footnote'),
            ('id', 1),
        ('exit', None),
        ('text', ' '),
        ('enter', 'strong'),
            ('text', 'nor'),
        ('exit', None),
        ('text', ' trite, '),
        ('enter', 'strong'),
                ('text', 'though '),
                ('enter', 'em'),
                    ('text', 'simple'),
                ('exit', None),
        ('exit', None),
        ('text', '.')
    ]

Pretend the above is being produced by a generator. My current implementation works, though the code is somewhat hideous and not easily extended to support links.

Any assistance would be greatly appreciated.

Updated to change the desired syntax from a complex nested list structure to a simple stream of tuples. Indentation for us humans. Formatting within the text of a link is OK. Here is a simple parser that generates the lexing result I'm looking for, but still doesn't handle links or footnotes.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡淡の花香 2024-12-07 20:41:17

好吧，这里有一个更完整的解析器，具有足够的可扩展性，可以完成我将来可能需要的任何操作。只花了三个小时。它的速度不是很快，但通常我正在编写的解析器类的输出无论如何都会被大量缓存。即使有了这个分词器和解析器，我的完整引擎仍然以 <0 秒的速度运行。默认 python-textile 渲染器的 SLoC 的 75%，同时保持一定速度。全部没有正则表达式。

脚注解析仍然存在，但与链接解析相比，这是次要的。输出（截至本发布）是：

tokens = [
    ('text', 'This '),
    ('enter', 'code'),
        ('text', 'example'),
    ('exit', None),
    ('text', ' is '),
    ('enter', 'a'),
        ('text', 'neither'),
        ('attr', ('href', '/defn/neither')),
    ('exit', None),
    ('text', ' complete[1] '),
    ('enter', 'strong'),
        ('text', 'nor'),
    ('exit', None),
    ('text', ' trite, '),
    ('enter', 'strong'),
        ('text', 'though '),
        ('enter', 'em'),
            ('text', 'simple'),
        ('exit', None),
    ('exit', None),
    ('text', '.')
]

Well, here's a more complete parser with sufficient extensibility to do whatever I may need in the future. It only took three hours. It's not terribly speedy, but generally the output of the class of parser I'm writing is heavily cached anyway. Even with this tokenizer and parser in place, my full engine still clocks in at < 75% of the SLoC of the default python-textile renderer while remaining somewhat faster. All without regular expressions.

Footnote parsing remains, but that's minor compared to link parsing. The output (as of this posting) is:

tokens = [
    ('text', 'This '),
    ('enter', 'code'),
        ('text', 'example'),
    ('exit', None),
    ('text', ' is '),
    ('enter', 'a'),
        ('text', 'neither'),
        ('attr', ('href', '/defn/neither')),
    ('exit', None),
    ('text', ' complete[1] '),
    ('enter', 'strong'),
        ('text', 'nor'),
    ('exit', None),
    ('text', ' trite, '),
    ('enter', 'strong'),
        ('text', 'though '),
        ('enter', 'em'),
            ('text', 'simple'),
        ('exit', None),
    ('exit', None),
    ('text', '.')
]

回复收藏 0 原文

~没有更多了~