HttpGetText(),自动检测字符集,并将源文件转换为 UTF8

发布于 2024-10-28 03:16:02 字数 412 浏览 0 评论 0原文

我使用 HttpGetText 和 Synapse for Delphi 7 Professional 来获取网页的源代码 - 但请随意推荐任何组件和代码。

目标是通过将非 ASCII 字符“统一”为单个字符集来节省一些时间,这样我就可以使用相同的 Delphi 代码来处理它。

所以我正在寻找类似“在 Notepad++ 中选择全部并转换为没有 BOM 的 UTF”的内容,如果你明白我的意思的话。 ANSI 而不是 UTF8 也可以。

网页采用 3 种字符集进行编码:UTF8、“ISO-8859-1=Win 1252=ANSI”以及没有字符集规范的 HTML4,即。 htmlencoded Å 内容中的类型字符。

如果我需要编写一个 PHP 页面来进行转换,那也很好。无论是最少的代码/时间。

I'm using HttpGetText with Synapse for Delphi 7 Professional to get the source of a web page - but feel free to recommend any component and code.

The goal is to save some time by 'unifying' non-ASCII characters to a single charset, so I can process it with the same Delphi code.

So I'm looking for something similar to "Select All and Convert To UTF without BOM in Notepad++", if you know what I mean. ANSI instead of UTF8 would also be okay.

Webpages are encoded in 3 charsets: UTF8, "ISO-8859-1=Win 1252=ANSI" and straight up the alley HTML4 without charset spec, ie. htmlencoded Å type characters in the content.

If I need to code a PHP page that does the conversion, that's fine too. Whatever is the least code / time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

留一抹残留的笑 2024-11-04 03:16:02

当您检索网页时,其 Content-Type 标头(或者有时 HTML 本身内的 标记)会告诉您数据所使用的字符集。您可以使用该字符集将数据解码为 Unicode,然后可以将 Unicode 编码为处理所需的任何内容。

When you retreive a webpage, its Content-Type header (or sometimes a <meta> tag inside the HTML itself) tells you which charset is being used for the data. You would decode the data to Unicode using that charset, then you can encode the Unicode to whatever you need for your processing.

成熟的代价 2024-11-04 03:16:02

相反,我在使用 GpTextStream 检索 HTML 后直接进行了反向转换。使文档符合 ISO-8859-1 使得它们可以直接使用 Delphi 进行处理,从而节省了大量的代码更改。输出时,所有数据都转换为 UTF-8 :)

这是一些代码。也许不是最漂亮的解决方案,但它确实在更短的时间内完成了工作。请注意,这是用于反向转换。

procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
    procedure LG2(ss:string);
    begin
        //dont log for now.
    end;

begin
    fs1 := TFileStream.Create(fileName,fmOpenRead);
    fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
    //compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
    //also works for ASCII sources with htmlencoded accent chars, naturally
    try
      LG2('Files opened OK.');
      GetMem(buf,bufsize);
      ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
      ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
      try
        siz:=ts1.Read(buf^,bufsize);
        LG2(inttostr(siz)+' bytes read.');
        if siz>0 then ts2.Write(buf^,siz);
      finally
        LG2('Bytes read and written OK.');
      FreeAndNil(ts1);FreeAndNil(ts2);end;
    finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
        LG2('Everything freed OK.');
    end;
end; // UTF8FileTo88591

I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)

Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.

procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
    procedure LG2(ss:string);
    begin
        //dont log for now.
    end;

begin
    fs1 := TFileStream.Create(fileName,fmOpenRead);
    fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
    //compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
    //also works for ASCII sources with htmlencoded accent chars, naturally
    try
      LG2('Files opened OK.');
      GetMem(buf,bufsize);
      ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
      ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
      try
        siz:=ts1.Read(buf^,bufsize);
        LG2(inttostr(siz)+' bytes read.');
        if siz>0 then ts2.Write(buf^,siz);
      finally
        LG2('Bytes read and written OK.');
      FreeAndNil(ts1);FreeAndNil(ts2);end;
    finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
        LG2('Everything freed OK.');
    end;
end; // UTF8FileTo88591
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文