HttpGetText()，自动检测字符集，并将源文件转换为 UTF8

发布于 2024-10-28 03:16:02 字数 412 浏览 7 评论 0原文

我使用 HttpGetText 和 Synapse for Delphi 7 Professional 来获取网页的源代码 - 但请随意推荐任何组件和代码。

目标是通过将非 ASCII 字符“统一”为单个字符集来节省一些时间，这样我就可以使用相同的 Delphi 代码来处理它。

所以我正在寻找类似“在 Notepad++ 中选择全部并转换为没有 BOM 的 UTF”的内容，如果你明白我的意思的话。 ANSI 而不是 UTF8 也可以。

网页采用 3 种字符集进行编码：UTF8、“ISO-8859-1=Win 1252=ANSI”以及没有字符集规范的 HTML4，即。 htmlencoded Å 内容中的类型字符。

如果我需要编写一个 PHP 页面来进行转换，那也很好。无论是最少的代码/时间。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

留一抹残留的笑 2024-11-04 03:16:02

当您检索网页时，其 Content-Type 标头（或者有时 HTML 本身内的标记）会告诉您数据所使用的字符集。您可以使用该字符集将数据解码为 Unicode，然后可以将 Unicode 编码为处理所需的任何内容。

回复收藏 0 原文

成熟的代价 2024-11-04 03:16:02

相反，我在使用 GpTextStream 检索 HTML 后直接进行了反向转换。使文档符合 ISO-8859-1 使得它们可以直接使用 Delphi 进行处理，从而节省了大量的代码更改。输出时，所有数据都转换为 UTF-8 :)

这是一些代码。也许不是最漂亮的解决方案，但它确实在更短的时间内完成了工作。请注意，这是用于反向转换。

procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
    procedure LG2(ss:string);
    begin
        //dont log for now.
    end;

begin
    fs1 := TFileStream.Create(fileName,fmOpenRead);
    fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
    //compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
    //also works for ASCII sources with htmlencoded accent chars, naturally
    try
      LG2('Files opened OK.');
      GetMem(buf,bufsize);
      ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
      ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
      try
        siz:=ts1.Read(buf^,bufsize);
        LG2(inttostr(siz)+' bytes read.');
        if siz>0 then ts2.Write(buf^,siz);
      finally
        LG2('Bytes read and written OK.');
      FreeAndNil(ts1);FreeAndNil(ts2);end;
    finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
        LG2('Everything freed OK.');
    end;
end; // UTF8FileTo88591

I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)

Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.

procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
    procedure LG2(ss:string);
    begin
        //dont log for now.
    end;

begin
    fs1 := TFileStream.Create(fileName,fmOpenRead);
    fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
    //compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
    //also works for ASCII sources with htmlencoded accent chars, naturally
    try
      LG2('Files opened OK.');
      GetMem(buf,bufsize);
      ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
      ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
      try
        siz:=ts1.Read(buf^,bufsize);
        LG2(inttostr(siz)+' bytes read.');
        if siz>0 then ts2.Write(buf^,siz);
      finally
        LG2('Bytes read and written OK.');
      FreeAndNil(ts1);FreeAndNil(ts2);end;
    finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
        LG2('Everything freed OK.');
    end;
end; // UTF8FileTo88591

回复收藏 0 原文

~没有更多了~