解析 HTTP 目录列表

发布于 2025-01-08 14:55:32 字数 567 浏览 1 评论 0原文

再会!我正在使用 Delphi XE 和 Indy TIdHTTP。使用 Get 方法我获取远程目录列表,我需要解析它 = 获取文件列表及其大小和时间戳并区分文件和子目录。请问,有什么好的惯例吗?先感谢您! Vojtech

这是示例:

<head>
  <title>127.0.0.1 - /</title>
</head>
<body>
  <H1>127.0.0.1 - /</H1><hr>
<pre>      
  Mittwoch, 30. März 2011    12:01        &lt;dir&gt; <A HREF="/SubDir/">SubDir</A><br />
  Mittwoch, 9. Februar 2005    17:14          113 <A HREF="/file.txt">file.txt</A><br />
</pre>
<hr>
</body>

Good day! I'm using Delphi XE and Indy TIdHTTP. Using Get method I get remote directory listing and I need to parse it = get list of files with their sizes and timestamps and distinguish files and subdirectories. Please, is there a good routine to do that? Thank you in advance! Vojtech

Here is the sample:

<head>
  <title>127.0.0.1 - /</title>
</head>
<body>
  <H1>127.0.0.1 - /</H1><hr>
<pre>      
  Mittwoch, 30. März 2011    12:01        <dir> <A HREF="/SubDir/">SubDir</A><br />
  Mittwoch, 9. Februar 2005    17:14          113 <A HREF="/file.txt">file.txt</A><br />
</pre>
<hr>
</body>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

疯了 2025-01-15 14:55:32

给定代码示例,我猜解析它的最快方法如下:

  • 识别包含所有列出行。应该很容易。
  • 之间的所有内容放入 TStringList 中。每一行都是一个文件或文件夹,格式非常简单。

  • 从每一行中提取链接,如果需要,还可以提取日期、时间和大小。最好使用正则表达式(您有 Delphi XE,因此您有内置的正则表达式)。

Given the code sample, I guess the fastest way to parse it would be like this:

  • Identify the <pre>...</pre> block containing all the listing lines. Should be easy.
  • Put everything between the <pre> and </pre> into a TStringList. Each line is a file or folder, and the format is very simple.
  • Extract the links from each line, extract the date, time and size if you need it. Best done with a regex (you've got Delphi XE so you've got built-in Regex).
逆夏时光 2025-01-15 14:55:32

这应该会给你一个使用 DOM 的良好开始和想法:

uses
  MSHTML,
  ActiveX,
  ComObj;

procedure DocumentFromString(Document: IHTMLDocument2; const S: WideString);
var
  v: OleVariant;
begin
  v := VarArrayCreate([0, 0], varVariant);
  v[0] := S;
  Document.Write(PSafeArray(TVarData(v).VArray));
  Document.Close;
end;

function StripMultipleChar(const S: string; const C: Char): string;
begin
  Result := S;
  while Pos(C + C, Result) <> 0 do
    Result := StringReplace(Result, C + C, C, [rfReplaceAll]);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  Document: IHTMLDocument2;
  Elements: IHTMLElementCollection;
  Element: IHTMLElement;
  I: Integer;
  Line: string;
begin
  Document := CreateComObject(CLASS_HTMLDocument) as IHTMLDocument2;
  DocumentFromString(Document, '<head>...'); // your HTML here

  Elements := Document.all.tags('A') as IHTMLElementCollection;
  for I := 0 to Elements.length - 1 do
  begin
    Element := Elements.item(I, '') as IHTMLElement;
    Memo1.Lines.Add('A HREF=' + Element.getAttribute('HREF', 2));
    Memo1.Lines.Add('A innerText=' + Element.innerText);

    // Text is returned immediately before the element
    Line := (Element as IHTMLElement2).getAdjacentText('beforeBegin');

    // Line => "Mittwoch, 30. März 2011 12:01 <dir>" OR:
    // Line => "Mittwoch, 9. Februar 2005 17:14 113"...
    // I don't know what is the actual delimiter:
    // It could be [space] or [tab] so we need to normalize the Line
    // If it's tabs then it's easier because the timestamps also contains spaces

    Line := Trim(Line);
    Line := StripMultipleChar(Line, #32); // strip multiple Spaces sequences
    Line := StripMultipleChar(Line, #9);  // strip multiple Tabs sequences

    // TODO: ParseLine (from right to left)

    Memo1.Lines.Add(Line);
    Memo1.Lines.Add('-------------');
  end;
end;

输出:

A HREF=/SubDir/
A innerText=SubDir
Mittwoch, 30. März 2011 12:01 <dir>
-------------
A HREF=/file.txt
A innerText=file.txt
Mittwoch, 9. Februar 2005 17:14 113
-------------

编辑:
我已将 StripMultipleChar 实现更改为更加简化。但我相信前一个版本对速度进行了更多优化。考虑到线路长度很短,性能上不会有太大差异。

This should give you a good start and idea using DOM:

uses
  MSHTML,
  ActiveX,
  ComObj;

procedure DocumentFromString(Document: IHTMLDocument2; const S: WideString);
var
  v: OleVariant;
begin
  v := VarArrayCreate([0, 0], varVariant);
  v[0] := S;
  Document.Write(PSafeArray(TVarData(v).VArray));
  Document.Close;
end;

function StripMultipleChar(const S: string; const C: Char): string;
begin
  Result := S;
  while Pos(C + C, Result) <> 0 do
    Result := StringReplace(Result, C + C, C, [rfReplaceAll]);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  Document: IHTMLDocument2;
  Elements: IHTMLElementCollection;
  Element: IHTMLElement;
  I: Integer;
  Line: string;
begin
  Document := CreateComObject(CLASS_HTMLDocument) as IHTMLDocument2;
  DocumentFromString(Document, '<head>...'); // your HTML here

  Elements := Document.all.tags('A') as IHTMLElementCollection;
  for I := 0 to Elements.length - 1 do
  begin
    Element := Elements.item(I, '') as IHTMLElement;
    Memo1.Lines.Add('A HREF=' + Element.getAttribute('HREF', 2));
    Memo1.Lines.Add('A innerText=' + Element.innerText);

    // Text is returned immediately before the element
    Line := (Element as IHTMLElement2).getAdjacentText('beforeBegin');

    // Line => "Mittwoch, 30. März 2011 12:01 <dir>" OR:
    // Line => "Mittwoch, 9. Februar 2005 17:14 113"...
    // I don't know what is the actual delimiter:
    // It could be [space] or [tab] so we need to normalize the Line
    // If it's tabs then it's easier because the timestamps also contains spaces

    Line := Trim(Line);
    Line := StripMultipleChar(Line, #32); // strip multiple Spaces sequences
    Line := StripMultipleChar(Line, #9);  // strip multiple Tabs sequences

    // TODO: ParseLine (from right to left)

    Memo1.Lines.Add(Line);
    Memo1.Lines.Add('-------------');
  end;
end;

Output:

A HREF=/SubDir/
A innerText=SubDir
Mittwoch, 30. März 2011 12:01 <dir>
-------------
A HREF=/file.txt
A innerText=file.txt
Mittwoch, 9. Februar 2005 17:14 113
-------------

EDIT:
I have changed StripMultipleChar implementation to be more simplified. yet I belive the former version was more optimized to speed. considering the fact that the Lines are very short in length, there will be no much differences in performance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文