在 Delphi 中解析一行的最快方法是什么?
我有一个巨大的文件,我必须逐行解析它。 速度至关重要。
一行的示例:
Token-1 这是下一个令牌最后上线的令牌 ^^ 当前位置 GetToken之后的位置
调用 GetToken,返回“Here-is-the-Next-Token”并将 CurrentPosition 设置为令牌最后一个字符的位置,以便为下一次调用 GetToken 做好准备。 令牌由一个或多个空格分隔。
假设该文件已位于内存中的 StringList 中。 它很容易装入内存,例如 200 MB。
我只担心解析的执行时间。 在 Delphi (Pascal) 中什么代码会产生绝对最快的执行速度?
I have a huge file that I must parse line by line. Speed is of the essence.
Example of a line:
Token-1 Here-is-the-Next-Token Last-Token-on-Line ^ ^ Current Position Position after GetToken
GetToken is called, returning "Here-is-the-Next-Token" and sets the CurrentPosition to the position of the last character of the token so that it is ready for the next call to GetToken. Tokens are separated by one or more spaces.
Assume the file is already in a StringList in memory. It fits in memory easily, say 200 MB.
I am worried only about the execution time for the parsing. What code will produce the absolute fastest execution in Delphi (Pascal)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
这是一个示例词法分析器,应该非常高效,但它假设所有源数据都在单个字符串中。 由于令牌很长,重新处理它来处理缓冲区是相当棘手的。
Here's a sample lexer that should be pretty efficient, but it assumes that all source data is in a single string. Reworking it to handle buffers is moderately tricky due to very long tokens.
这是一个非常简单的词法分析器的蹩脚实现。 这可能会给你一个想法。
请注意此示例的局限性 - 不涉及缓冲,没有 Unicode(这是 Delphi 7 项目的摘录)。 您可能会在认真的实施中需要这些。
Here is a lame ass implementation of a very simple lexer. This might give you an idea.
Note the limitations of this example - no buffering involved, no Unicode (this is an excerpt from a Delphi 7 project). You would probably need those in a serious implementation.
我制作了一个基于状态引擎(DFA)的词法分析器。 它与桌子配合使用并且速度相当快。 但还有可能更快的选择。
这也取决于语言。 简单的语言可能有智能的算法。
该表是一个记录数组,每个记录包含 2 个字符和 1 个整数。 对于每个标记,词法分析器都会从位置 0 开始遍历表格:
它很简单,而且工作起来就像一个魅力。
I made a lexical analyser based on a state engine (DFA). It works with a table and is pretty fast. But there are possible faster options.
It also depends on the language. A simple language can possibly have a smart algorithm.
The table is an array of records each containing 2 chars and 1 integer. For each token the lexer walks through the table, startting at position 0:
It is simple and works like a charm.
如果速度至关重要,那么自定义代码就是答案。 查看将文件映射到内存的 Windows API。 然后,您可以使用指向下一个字符的指针来执行标记,并根据需要前进。
这是我进行映射的代码:
If speed is of the essence, custom code is the answer. Check out the Windows API that will map your file into memory. You can then just use a pointer to the next character to do your tokens, marching through as required.
This is my code for doing a mapping:
我认为最大的瓶颈始终是将文件放入内存。 一旦你将它存入内存(显然不是一次全部,但如果我是你,我会使用缓冲区),实际的解析应该是微不足道的。
I think the biggest bottleneck is always going to be getting the file into memory. Once you have it in memory (obviously not all of it at once, but I would work with buffers if I were you), the actual parsing should be insignificant.
这就引出了另一个问题——有多大?
给我们一些线索,例如行数或 # 或 Mb (Gb)?
然后我们会知道它是否适合内存,是否需要基于磁盘等。
首先,我会使用我的 WordList(S: String; AList: TStringlist);
然后你可以访问每个令牌作为 Alist[n]...
或对它们进行排序或其他什么。
This begs another question - How big?
Give us a clue like # of lines or # or Mb (Gb)?
Then we will know if it fits in memory, needs to be disk based etc.
At first pass I would use my WordList(S: String; AList: TStringlist);
then you can access each token as Alist[n]...
or sort them or whatever.
一旦被解析,速度将始终与您正在做的事情相关。 到目前为止,词法解析器是从文本流转换为标记的最快方法,无论大小如何。 课程单元中的 TParser 是一个很好的起点。
就我个人而言,自从我需要编写解析器以来已经有一段时间了,但另一种更过时但经过尝试且正确的方法是使用 LEX/YACC 构建语法,然后让它将语法转换为可用于执行处理的代码。 DYacc 是 Delphi 版本...不确定它是否仍然可以编译,但是如果你想做一些老派的事情,值得一看。 如果你能找到一本,这里的龙书会有很大帮助。
Speed will always be relative to what you are doing once it is parsed. A lexical parser by far is the fastest method of converting to tokens from a text stream regardless of size. TParser in the classes unit is a great place to start.
Personally its been a while since I needed to write a parser, but another more dated yet tried and true method would be to use LEX/YACC to build a grammar then have it convert the grammar into code you can use to perform your processing. DYacc is a Delphi version...not sure if it still compiles or not, but worth a look if you want to do things old school. The dragon book here would be of big help, if you can find a copy.
自己动手无疑是最快的方法。 有关此主题的更多信息,您可以查看 Synedit 的源代码,其中包含词法分析器(在项目上下文中称为荧光笔)适用于市场上的任何语言。 我建议您以这些词法分析器之一为基础,并根据自己的用途进行修改。
Rolling your own is the fastest way for sure. For more on this topic, you could see Synedit's source code which contains lexers (called highlighters in the project's context) for about any language on the market. I suggest you take one of those lexers as a base and modify for your own usage.
编写代码的最快方法可能是创建一个 TStringList 并将文本文件中的每一行分配给 CommaText 属性。 默认情况下,空格是分隔符,因此每个标记将获得一个 StringList 项目。
不过,您自己解析每一行可能会获得更好的性能。
The fastest way to write the code would probably be to create a TStringList and assign each line in your text file to the CommaText property. By default, white space is a delimiter, so you will get one StringList item per token.
You'll probably get better performance by parsing each line yourself, though.