解析带有特殊标记的文本文件
我需要使用 Python 解析 DSL 文件。 DSL 文件是一个文本文件,其中的文本具有 ABBYY Lingvo 使用的特殊标记。
它看起来像:
activate
[m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
[m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
[m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
[m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
{{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
{{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}
现在我看到使用 regexps 解析此文件的唯一选项。但我怀疑它是否可以实现,因为该格式的标签具有一定的层次结构,其中一些标签位于其他标签内部。
我无法使用特殊的 xml
和 html
解析器。它们非常适合创建文档的树形结构,但它们是为 html
和 xml
的特殊标签而设计的。
解析这种格式的文件的最佳方法是什么?有没有用于此目的的 Python 库?
I need to parse a DSL
file using Python. A DSL file is a text file with a text having a special markup with tags used by ABBYY Lingvo.
It looks like:
activate
[m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
[m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
[m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
[m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
{{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
{{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}
Now I see the only option to parse this file using regexps
. But I doubt if it can be achieved since tags in that format has some hierarchy, where some of them are inside others.
I can't use special xml
and html
parsers. They are perfect in creating a tree-structure of the document, but they are designed for special tags of html
and xml
.
What is the best way to parse a file in such a format? Is there any Python library for that purpose?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看 http://www.dabeaz.com/ply/
您也许可以定义语法快速、轻松地作为一组词汇规则和一些语法产生式。
如果您不喜欢那个,这里有一个替代品列表。
http://wiki.python.org/moin/LanguageParsing
Look at http://www.dabeaz.com/ply/
You may be able to define the syntax quickly and easily as a set of Lexical rules and some grammar productions.
If you don't like that one, here's a list of alternatives.
http://wiki.python.org/moin/LanguageParsing
将 RegExp 用于除琐碎用途之外的其他用途会带来心痛和痛苦。
如果您坚持使用正则表达式(不推荐),请查看使用的方法此处 关于 XML
如果“.dsl”指的是 ABBRY 或 Lingvo dict 格式,您可能需要查看 stardict。 它可以读取ABBRY dsl 格式。
Using RegExp for this for something other than trivial use will give heartache and pain.
If you insist on using a RegEx (NOT RECOMMENDED), look at the methods used HERE on XML
If by ".dsl" you mean the ABBRY or Lingvo dict format, you may want to look at stardict. It can read the ABBRY dsl format.