解析带有特殊标记的文本文件

发布于 2024-12-03 17:00:22 字数 1449 浏览 5 评论 0原文

我需要使用 Python 解析 DSL 文件。 DSL 文件是一个文本文件，其中的文本具有 ABBYY Lingvo 使用的特殊标记。

它看起来像：

activate
    [m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
    [m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
    [m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
    [m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
    {{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
    {{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}

现在我看到使用 regexps 解析此文件的唯一选项。但我怀疑它是否可以实现，因为该格式的标签具有一定的层次结构，其中一些标签位于其他标签内部。

我无法使用特殊的 xml 和 html 解析器。它们非常适合创建文档的树形结构，但它们是为 html 和 xml 的特殊标签而设计的。

解析这种格式的文件的最佳方法是什么？有没有用于此目的的 Python 库？

原文

I need to parse a DSL file using Python. A DSL file is a text file with a text having a special markup with tags used by ABBYY Lingvo.

It looks like:

activate
    [m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
    [m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
    [m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
    [m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
    {{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
    {{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}

Now I see the only option to parse this file using regexps. But I doubt if it can be achieved since tags in that format has some hierarchy, where some of them are inside others.

I can't use special xml and html parsers. They are perfect in creating a tree-structure of the document, but they are designed for special tags of html and xml.

What is the best way to parse a file in such a format? Is there any Python library for that purpose?

分享到QQ

分享到微博