在 Delphi 中解析一行的最快方法是什么?

发布于 2024-07-08 04:29:59 字数 420 浏览 12 评论 0原文

我有一个巨大的文件,我必须逐行解析它。 速度至关重要。

一行的示例:

Token-1 这是下一个令牌最后上线的令牌 
        ^^ 
     当前位置 
     GetToken之后的位置 
  

调用 GetToken,返回“Here-is-the-Next-Token”并将 CurrentPosition 设置为令牌最后一个字符的位置,以便为下一次调用 GetToken 做好准备。 令牌由一个或多个空格分隔。

假设该文件已位于内存中的 StringList 中。 它很容易装入内存,例如 200 MB。

我只担心解析的执行时间。 在 Delphi (Pascal) 中什么代码会产生绝对最快的执行速度?

I have a huge file that I must parse line by line. Speed is of the essence.

Example of a line:

Token-1   Here-is-the-Next-Token      Last-Token-on-Line
      ^                        ^
   Current                 Position
   Position              after GetToken

GetToken is called, returning "Here-is-the-Next-Token" and sets the CurrentPosition to the position of the last character of the token so that it is ready for the next call to GetToken. Tokens are separated by one or more spaces.

Assume the file is already in a StringList in memory. It fits in memory easily, say 200 MB.

I am worried only about the execution time for the parsing. What code will produce the absolute fastest execution in Delphi (Pascal)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

方圜几里 2024-07-15 04:29:59
  • 使用 PChar 递增来提高处理速度
  • 如果不需要某些标记,则仅按需复制标记数据
  • 在实际扫描字符时将 PChar 复制到局部变量
  • 将源数据保留在单个缓冲区中,除非必须逐行处理,即使这样,也要考虑将行处理作为词法分析器识别器中的单独标记进行处理
  • 如果您确实知道编码,请考虑处理直接来自文件的字节数组缓冲区; 如果使用 Delphi 2009,请使用 PAnsiChar 而不是 PChar,除非您知道编码是 UTF16-LE。
  • 如果您知道唯一的空白是 #32(ASCII 空格),或类似的有限字符集,则可能有一些巧妙的位操作技巧,可以让您使用整数扫描一次处理 4 个字节。 不过,我不希望在这里取得重大胜利,而且代码会像泥一样清晰。

这是一个示例词法分析器,应该非常高效,但它假设所有源数据都在单个字符串中。 由于令牌很长,重新处理它来处理缓冲区是相当棘手的。

type
  TLexer = class
  private
    FData: string;
    FTokenStart: PChar;
    FCurrPos: PChar;
    function GetCurrentToken: string;
  public
    constructor Create(const AData: string);
    function GetNextToken: Boolean;
    property CurrentToken: string read GetCurrentToken;
  end;

{ TLexer }

constructor TLexer.Create(const AData: string);
begin
  FData := AData;
  FCurrPos := PChar(FData);
end;

function TLexer.GetCurrentToken: string;
begin
  SetString(Result, FTokenStart, FCurrPos - FTokenStart);
end;

function TLexer.GetNextToken: Boolean;
var
  cp: PChar;
begin
  cp := FCurrPos; // copy to local to permit register allocation

  // skip whitespace; this test could be converted to an unsigned int
  // subtraction and compare for only a single branch
  while (cp^ > #0) and (cp^ <= #32) do
    Inc(cp);

  // using null terminater for end of file
  Result := cp^ <> #0;

  if Result then
  begin
    FTokenStart := cp;
    Inc(cp);
    while cp^ > #32 do
      Inc(cp);
  end;

  FCurrPos := cp;
end;
  • Use PChar incrementing for speed of processing
  • If some tokens are not needed, only copy token data on demand
  • Copy PChar to local variable when actually scanning through characters
  • Keep source data in a single buffer unless you must handle line by line, and even then, consider handling line processing as a separate token in the lexer recognizer
  • Consider processing a byte array buffer that has come straight from the file, if you definitely know the encoding; if using Delphi 2009, use PAnsiChar instead of PChar, unless of course you know the encoding is UTF16-LE.
  • If you know that the only whitespace is going to be #32 (ASCII space), or a similarly limited set of characters, there may be some clever bit manipulation hacks that can let you process 4 bytes at a time using Integer scanning. I wouldn't expect big wins here though, and the code will be as clear as mud.

Here's a sample lexer that should be pretty efficient, but it assumes that all source data is in a single string. Reworking it to handle buffers is moderately tricky due to very long tokens.

type
  TLexer = class
  private
    FData: string;
    FTokenStart: PChar;
    FCurrPos: PChar;
    function GetCurrentToken: string;
  public
    constructor Create(const AData: string);
    function GetNextToken: Boolean;
    property CurrentToken: string read GetCurrentToken;
  end;

{ TLexer }

constructor TLexer.Create(const AData: string);
begin
  FData := AData;
  FCurrPos := PChar(FData);
end;

function TLexer.GetCurrentToken: string;
begin
  SetString(Result, FTokenStart, FCurrPos - FTokenStart);
end;

function TLexer.GetNextToken: Boolean;
var
  cp: PChar;
begin
  cp := FCurrPos; // copy to local to permit register allocation

  // skip whitespace; this test could be converted to an unsigned int
  // subtraction and compare for only a single branch
  while (cp^ > #0) and (cp^ <= #32) do
    Inc(cp);

  // using null terminater for end of file
  Result := cp^ <> #0;

  if Result then
  begin
    FTokenStart := cp;
    Inc(cp);
    while cp^ > #32 do
      Inc(cp);
  end;

  FCurrPos := cp;
end;
魔法少女 2024-07-15 04:29:59

这是一个非常简单的词法分析器的蹩脚实现。 这可能会给你一个想法。

请注意此示例的局限性 - 不涉及缓冲,没有 Unicode(这是 Delphi 7 项目的摘录)。 您可能会在认真的实施中需要这些。

{ Implements a simpe lexer class. } 
unit Simplelexer;

interface

uses Classes, Sysutils, Types, dialogs;

type

  ESimpleLexerFinished = class(Exception) end;

  TProcTableProc = procedure of object;

  // A very simple lexer that can handle numbers, words, symbols - no comment handling  
  TSimpleLexer = class(TObject)
  private
    FLineNo: Integer;
    Run: Integer;
    fOffset: Integer;
    fRunOffset: Integer; // helper for fOffset
    fTokenPos: Integer;
    pSource: PChar;
    fProcTable: array[#0..#255] of TProcTableProc;
    fUseSimpleStrings: Boolean;
    fIgnoreSpaces: Boolean;
    procedure MakeMethodTables;
    procedure IdentProc;
    procedure NewLineProc;
    procedure NullProc;
    procedure NumberProc;
    procedure SpaceProc;
    procedure SymbolProc;
    procedure UnknownProc;
  public
    constructor Create;
    destructor Destroy; override;
    procedure Feed(const S: string);
    procedure Next;
    function GetToken: string;
    function GetLineNo: Integer;
    function GetOffset: Integer;

    property IgnoreSpaces: boolean read fIgnoreSpaces write fIgnoreSpaces;
    property UseSimpleStrings: boolean read fUseSimpleStrings write fUseSimpleStrings;
  end;

implementation

{ TSimpleLexer }

constructor TSimpleLexer.Create;
begin
  makeMethodTables;
  fUseSimpleStrings := false;
  fIgnoreSpaces := false;
end;

destructor TSimpleLexer.Destroy;
begin
  inherited;
end;

procedure TSimpleLexer.Feed(const S: string);
begin
  Run := 0;
  FLineNo := 1;
  FOffset := 1;
  pSource := PChar(S);
end;

procedure TSimpleLexer.Next;
begin
  fTokenPos := Run;
  foffset := Run - frunOffset + 1;
  fProcTable[pSource[Run]];
end;

function TSimpleLexer.GetToken: string;
begin
  SetString(Result, (pSource + fTokenPos), Run - fTokenPos);
end;

function TSimpleLexer.GetLineNo: Integer;
begin
  Result := FLineNo;
end;

function TSimpleLexer.GetOffset: Integer;
begin
  Result := foffset;
end;

procedure TSimpleLexer.MakeMethodTables;
var
  I: Char;
begin
  for I := #0 to #255 do
    case I of
      '@', '&', '}', '{', ':', ',', ']', '[', '*',
        '^', ')', '(', ';', '/', '=', '-', '+', '#', '>', '<', '
,
        '.', '"', #39:
        fProcTable[I] := SymbolProc;
      #13, #10: fProcTable[I] := NewLineProc;
      'A'..'Z', 'a'..'z', '_': fProcTable[I] := IdentProc;
      #0: fProcTable[I] := NullProc;
      '0'..'9': fProcTable[I] := NumberProc;
      #1..#9, #11, #12, #14..#32: fProcTable[I] := SpaceProc;
    else
      fProcTable[I] := UnknownProc;
    end;
end;

procedure TSimpleLexer.UnknownProc;
begin
  inc(run);
end;

procedure TSimpleLexer.SymbolProc;
begin
  if fUseSimpleStrings then
  begin
    if pSource[run] = '"' then
    begin
      Inc(run);
      while pSource[run] <> '"' do
      begin
        Inc(run);
        if pSource[run] = #0 then
        begin
          NullProc;
        end;
      end;
    end;
    Inc(run);
  end
  else
    inc(run);
end;

procedure TSimpleLexer.IdentProc;
begin
  while pSource[Run] in ['_', 'A'..'Z', 'a'..'z', '0'..'9'] do
    Inc(run);
end;

procedure TSimpleLexer.NumberProc;
begin
  while pSource[run] in ['0'..'9'] do
    inc(run);
end;

procedure TSimpleLexer.SpaceProc;
begin
  while pSource[run] in [#1..#9, #11, #12, #14..#32] do
    inc(run);
  if fIgnoreSpaces then Next;
end;

procedure TSimpleLexer.NewLineProc;
begin
  inc(FLineNo);
  inc(run);
  case pSource[run - 1] of
    #13:
      if pSource[run] = #10 then inc(run);
  end;
  foffset := 1;
  fRunOffset := run;
end;

procedure TSimpleLexer.NullProc;
begin
  raise ESimpleLexerFinished.Create('');
end;

end.

Here is a lame ass implementation of a very simple lexer. This might give you an idea.

Note the limitations of this example - no buffering involved, no Unicode (this is an excerpt from a Delphi 7 project). You would probably need those in a serious implementation.

{ Implements a simpe lexer class. } 
unit Simplelexer;

interface

uses Classes, Sysutils, Types, dialogs;

type

  ESimpleLexerFinished = class(Exception) end;

  TProcTableProc = procedure of object;

  // A very simple lexer that can handle numbers, words, symbols - no comment handling  
  TSimpleLexer = class(TObject)
  private
    FLineNo: Integer;
    Run: Integer;
    fOffset: Integer;
    fRunOffset: Integer; // helper for fOffset
    fTokenPos: Integer;
    pSource: PChar;
    fProcTable: array[#0..#255] of TProcTableProc;
    fUseSimpleStrings: Boolean;
    fIgnoreSpaces: Boolean;
    procedure MakeMethodTables;
    procedure IdentProc;
    procedure NewLineProc;
    procedure NullProc;
    procedure NumberProc;
    procedure SpaceProc;
    procedure SymbolProc;
    procedure UnknownProc;
  public
    constructor Create;
    destructor Destroy; override;
    procedure Feed(const S: string);
    procedure Next;
    function GetToken: string;
    function GetLineNo: Integer;
    function GetOffset: Integer;

    property IgnoreSpaces: boolean read fIgnoreSpaces write fIgnoreSpaces;
    property UseSimpleStrings: boolean read fUseSimpleStrings write fUseSimpleStrings;
  end;

implementation

{ TSimpleLexer }

constructor TSimpleLexer.Create;
begin
  makeMethodTables;
  fUseSimpleStrings := false;
  fIgnoreSpaces := false;
end;

destructor TSimpleLexer.Destroy;
begin
  inherited;
end;

procedure TSimpleLexer.Feed(const S: string);
begin
  Run := 0;
  FLineNo := 1;
  FOffset := 1;
  pSource := PChar(S);
end;

procedure TSimpleLexer.Next;
begin
  fTokenPos := Run;
  foffset := Run - frunOffset + 1;
  fProcTable[pSource[Run]];
end;

function TSimpleLexer.GetToken: string;
begin
  SetString(Result, (pSource + fTokenPos), Run - fTokenPos);
end;

function TSimpleLexer.GetLineNo: Integer;
begin
  Result := FLineNo;
end;

function TSimpleLexer.GetOffset: Integer;
begin
  Result := foffset;
end;

procedure TSimpleLexer.MakeMethodTables;
var
  I: Char;
begin
  for I := #0 to #255 do
    case I of
      '@', '&', '}', '{', ':', ',', ']', '[', '*',
        '^', ')', '(', ';', '/', '=', '-', '+', '#', '>', '<', '
,
        '.', '"', #39:
        fProcTable[I] := SymbolProc;
      #13, #10: fProcTable[I] := NewLineProc;
      'A'..'Z', 'a'..'z', '_': fProcTable[I] := IdentProc;
      #0: fProcTable[I] := NullProc;
      '0'..'9': fProcTable[I] := NumberProc;
      #1..#9, #11, #12, #14..#32: fProcTable[I] := SpaceProc;
    else
      fProcTable[I] := UnknownProc;
    end;
end;

procedure TSimpleLexer.UnknownProc;
begin
  inc(run);
end;

procedure TSimpleLexer.SymbolProc;
begin
  if fUseSimpleStrings then
  begin
    if pSource[run] = '"' then
    begin
      Inc(run);
      while pSource[run] <> '"' do
      begin
        Inc(run);
        if pSource[run] = #0 then
        begin
          NullProc;
        end;
      end;
    end;
    Inc(run);
  end
  else
    inc(run);
end;

procedure TSimpleLexer.IdentProc;
begin
  while pSource[Run] in ['_', 'A'..'Z', 'a'..'z', '0'..'9'] do
    Inc(run);
end;

procedure TSimpleLexer.NumberProc;
begin
  while pSource[run] in ['0'..'9'] do
    inc(run);
end;

procedure TSimpleLexer.SpaceProc;
begin
  while pSource[run] in [#1..#9, #11, #12, #14..#32] do
    inc(run);
  if fIgnoreSpaces then Next;
end;

procedure TSimpleLexer.NewLineProc;
begin
  inc(FLineNo);
  inc(run);
  case pSource[run - 1] of
    #13:
      if pSource[run] = #10 then inc(run);
  end;
  foffset := 1;
  fRunOffset := run;
end;

procedure TSimpleLexer.NullProc;
begin
  raise ESimpleLexerFinished.Create('');
end;

end.
剪不断理还乱 2024-07-15 04:29:59

我制作了一个基于状态引擎(DFA)的词法分析器。 它与桌子配合使用并且速度相当快。 但还有可能更快的选择。

这也取决于语言。 简单的语言可能有智能的算法。

该表是一个记录数组,每个记录包含 2 个字符和 1 个整数。 对于每个标记,词法分析器都会从位置 0 开始遍历表格:

state := 0;
result := tkNoToken;
while (result = tkNoToken) do begin
  if table[state].c1 > table[state].c2 then
    result := table[state].value
  else if (table[state].c1 <= c) and (c <= table[state].c2) then begin
    c := GetNextChar();
    state := table[state].value;
  end else
    Inc(state);
end;

它很简单,而且工作起来就像一个魅力。

I made a lexical analyser based on a state engine (DFA). It works with a table and is pretty fast. But there are possible faster options.

It also depends on the language. A simple language can possibly have a smart algorithm.

The table is an array of records each containing 2 chars and 1 integer. For each token the lexer walks through the table, startting at position 0:

state := 0;
result := tkNoToken;
while (result = tkNoToken) do begin
  if table[state].c1 > table[state].c2 then
    result := table[state].value
  else if (table[state].c1 <= c) and (c <= table[state].c2) then begin
    c := GetNextChar();
    state := table[state].value;
  end else
    Inc(state);
end;

It is simple and works like a charm.

寻梦旅人 2024-07-15 04:29:59

如果速度至关重要,那么自定义代码就是答案。 查看将文件映射到内存的 Windows API。 然后,您可以使用指向下一个字符的指针来执行标记,并根据需要前进。

这是我进行映射的代码:

procedure TMyReader.InitialiseMapping(szFilename : string);
var
//  nError : DWORD;
    bGood : boolean;
begin
    bGood := False;
    m_hFile := CreateFile(PChar(szFilename), GENERIC_READ, 0, nil, OPEN_EXISTING, 0, 0);
    if m_hFile <> INVALID_HANDLE_VALUE then
    begin
        m_hMap := CreateFileMapping(m_hFile, nil, PAGE_READONLY, 0, 0, nil);
        if m_hMap <> 0 then
        begin
            m_pMemory := MapViewOfFile(m_hMap, FILE_MAP_READ, 0, 0, 0);
            if m_pMemory <> nil then
            begin
                htlArray := Pointer(Integer(m_pMemory) + m_dwDataPosition);
                bGood := True;
            end
            else
            begin
//              nError := GetLastError;
            end;
        end;
    end;
    if not bGood then
        raise Exception.Create('Unable to map token file into memory');
end;

If speed is of the essence, custom code is the answer. Check out the Windows API that will map your file into memory. You can then just use a pointer to the next character to do your tokens, marching through as required.

This is my code for doing a mapping:

procedure TMyReader.InitialiseMapping(szFilename : string);
var
//  nError : DWORD;
    bGood : boolean;
begin
    bGood := False;
    m_hFile := CreateFile(PChar(szFilename), GENERIC_READ, 0, nil, OPEN_EXISTING, 0, 0);
    if m_hFile <> INVALID_HANDLE_VALUE then
    begin
        m_hMap := CreateFileMapping(m_hFile, nil, PAGE_READONLY, 0, 0, nil);
        if m_hMap <> 0 then
        begin
            m_pMemory := MapViewOfFile(m_hMap, FILE_MAP_READ, 0, 0, 0);
            if m_pMemory <> nil then
            begin
                htlArray := Pointer(Integer(m_pMemory) + m_dwDataPosition);
                bGood := True;
            end
            else
            begin
//              nError := GetLastError;
            end;
        end;
    end;
    if not bGood then
        raise Exception.Create('Unable to map token file into memory');
end;
浪漫人生路 2024-07-15 04:29:59

我认为最大的瓶颈始终是将文件放入内存。 一旦你将它存入内存(显然不是一次全部,但如果我是你,我会使用缓冲区),实际的解析应该是微不足道的。

I think the biggest bottleneck is always going to be getting the file into memory. Once you have it in memory (obviously not all of it at once, but I would work with buffers if I were you), the actual parsing should be insignificant.

苍景流年 2024-07-15 04:29:59

这就引出了另一个问题——有多大?
给我们一些线索,例如行数或 # 或 Mb (Gb)?
然后我们会知道它是否适合内存,是否需要基于磁盘等。

首先,我会使用我的 WordList(S: String; AList: TStringlist);

然后你可以访问每个令牌作为 Alist[n]...
或对它们进行排序或其他什么。

This begs another question - How big?
Give us a clue like # of lines or # or Mb (Gb)?
Then we will know if it fits in memory, needs to be disk based etc.

At first pass I would use my WordList(S: String; AList: TStringlist);

then you can access each token as Alist[n]...
or sort them or whatever.

凉墨 2024-07-15 04:29:59

一旦被解析,速度将始终与您正在做的事情相关。 到目前为止,词法解析器是从文本流转换为标记的最快方法,无论大小如何。 课程单元中的 TParser 是一个很好的起点。

就我个人而言,自从我需要编写解析器以来已经有一段时间了,但另一种更过时但经过尝试且正确的方法是使用 LEX/YACC 构建语法,然后让它将语法转换为可用于执行处理的代码。 DYacc 是 Delphi 版本...不确定它是否仍然可以编译,但是如果你想做一些老派的事情,值得一看。 如果你能找到一本,这里的龙书会有很大帮助。

Speed will always be relative to what you are doing once it is parsed. A lexical parser by far is the fastest method of converting to tokens from a text stream regardless of size. TParser in the classes unit is a great place to start.

Personally its been a while since I needed to write a parser, but another more dated yet tried and true method would be to use LEX/YACC to build a grammar then have it convert the grammar into code you can use to perform your processing. DYacc is a Delphi version...not sure if it still compiles or not, but worth a look if you want to do things old school. The dragon book here would be of big help, if you can find a copy.

傲影 2024-07-15 04:29:59

自己动手无疑是最快的方法。 有关此主题的更多信息,您可以查看 Synedit 的源代码,其中包含词法分析器(在项目上下文中称为荧光笔)适用于市场上的任何语言。 我建议您以这些词法分析器之一为基础,并根据自己的用途进行修改。

Rolling your own is the fastest way for sure. For more on this topic, you could see Synedit's source code which contains lexers (called highlighters in the project's context) for about any language on the market. I suggest you take one of those lexers as a base and modify for your own usage.

夜未央樱花落 2024-07-15 04:29:59

编写代码的最快方法可能是创建一个 TStringList 并将文本文件中的每一行分配给 CommaText 属性。 默认情况下,空格是分隔符,因此每个标记将获得一个 StringList 项目。

MyStringList.CommaText := s;
for i := 0 to MyStringList.Count - 1 do
begin
  // process each token here
end;

不过,您自己解析每一行可能会获得更好的性能。

The fastest way to write the code would probably be to create a TStringList and assign each line in your text file to the CommaText property. By default, white space is a delimiter, so you will get one StringList item per token.

MyStringList.CommaText := s;
for i := 0 to MyStringList.Count - 1 do
begin
  // process each token here
end;

You'll probably get better performance by parsing each line yourself, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文