如何在 Delphi 中解析复杂的文件格式? (不是 CSV、XML 等)
自从我必须解析比 CSV 或 XML 更难的文件以来已经有几年了,所以我缺乏实践。我的任务是在 Delphi 应用程序中解析名为 NeXus 的文件格式。
问题是我只是不知道从哪里开始,我是否使用标记器、正则表达式等?也许我现在需要的是一个教程。
It's been a few years since I've had to parse any files which were harder than CSV or XML so I am out of practice. I've been given the task of parsing a file format called NeXus in a Delphi application.
The problem is I just don't know where to start, do I use a tokenizer, regex, etc? Maybe even a tutorial might be what I need at this point.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
看看这个,它是商业广告,但看起来像一个有趣的玩具:
http://dpg.zenithlab.com/
但是,实际上:对于 Nexus,您不需要复杂的解析器。
一些位置检查代码,一些字符串分割和括号计数,你就已经写好了。
我将使用一个简单的一次标记解析器来解析它,如下所示:
该线型有一个枚举类型。
对于上面的内容,我会为自己编写一些帮助程序,最终我可能需要编写的一件事是一个像这样的小标记分割函数:
function GetToken( var inputString:String;outputToken:String; const Separators:TStrings ;关键字:TStrings;ParenFlag:Boolean):Boolean;
当 GetToken 能够从 inputString 中找到并返回令牌字符串时,它会返回 true,它会跳过任何前导空格,并在找到分隔符时终止。分隔符是空格或逗号等项目。
ParenFlag:True 意味着我得到的下一个标记应该是整个带括号的项目列表。一旦我得到整个括号列表(((a,b),(c,d),(e,f))),那么我将调用另一个函数,该函数将该列表的内容解压到列表/的某些数据结构中数组。
我不推荐大型解析器引擎,如果你在编写解析器之前先编写 BNF 语法,那么 BNF 语法会帮助你编写代码。但这里没有什么残酷到你无法解析它。
您是否需要对此进行查询/转换?您认为需要将其转换为 json 或 xml 才能进一步使用它吗?
Check this out, it's commercial, but it looks like a fun toy:
http://dpg.zenithlab.com/
But, actually: For nexus you do not need a complicated parser.
A bit of position checking code, and some string-splitting and parenthesis counting, and you've got it written.
I would parse it using a simple token-at-a-time parser like this:
have an enumerated type for this line type.
For the above I would code for myself a little set of helpers, and eventually one of the things I might need to write is a little token splitting function like this:
function GetToken( var inputString:String;outputToken:String; const Separators:TStrings;Keywords:TStrings;ParenFlag:Boolean):Boolean;
GetToken would return true when it was able to find and return a token string from inputString, it would skip any leading whitespace, and terminate when it finds a separator. Separators are items like space or comma.
ParenFlag:True would mean that the next token I get should be an entire parenthesized list of items. Once I get the whole parenthesized list (((a,b),(c,d),(e,f))) then I would call another function that would unpack the content of that list into some data structure for the lists/arrays.
I do not recommend the big parser engine, and the BNF grammar thing will help you write the code if you write a BNF grammar first before you write the parser. But there's nothing so brutal here that you can't parse it.
Are you going to be expected to do queries/transforms on this? Do you think you need to convert it into json or xml in order to work further with it?
除了梅森的非常好的回答。 Delphi 中有一个很棒的小类,但经常被低估,您可以从中学到一种非常好的技术,那就是 PageProducer 类。
看一下它解析 HTML 和显示事件(例如查找标签、属性等)的方式。我并不是说使用 PageProducer(因为 Nexus 无法使用),但它是一个非常简单、优雅和强大的技术。
In addition to Mason's very nice answer. There is a great little class in Delphi that is often underappreciated, and one that you can learn a really nice technique from and thats the PageProducer class.
Have a look at the way that it parses HTML and surfaces events on things like finding tags, attributes etc. I'm not saying use the PageProducer (because you won't be able to for Nexus) but its a very simple, elegant and powerful technique.
尚未找到测试格式,但状态机无法解析。添加递归以遍历树中的巢穴。它们是一种易于编写且相对快速的解析引擎,几乎可以为任何模式文本文件构建。通常比使用脚本语言启动更容易。我有为 HTML、XML、HL7 和各种医疗 EDI 格式编写的自定义文件。
Haven't found a test format yet a state machine won't parse. Add in recursion to run down nests in trees. They are an easily written relatively quick parsing engine that can be built for virtually any patterned text file. Often easier than using a scripting language to boot. I have custom ones written for HTML, XML, HL7 and a variety of medical EDI formats.
看看 GOLD 解析器。它是一个元解析系统,允许您为语言/文件格式定义正式语法。它创建一个解析规则文件,您可以将其与输入文件一起输入到分词器中,并在内存中创建一个语法树。
网站上提供了标记器的 Delphi 实现。它使解析变得更加容易,因为词法分析和标记化已经为您处理好了,您只需担心在正式语法中定义标记,然后在解析它们后对其进行解释。
Have a look at GOLD Parser. It's a meta-parsing system that allows you to define a formal grammar for a language/file format. It creates a parsing rules file which you feed into a tokenizer, together with your input file, and it creates a syntax tree in memory.
There's a Delphi implementation of the tokenizer available on the website. It makes parsing a lot easier since the lexing and tokenizing is already taken care of for you, and all you have to worry about is defining the tokens in a formal grammar and then interpreting them once they've been parsed.