HttpGetText(),自动检测字符集,并将源文件转换为 UTF8
我使用 HttpGetText 和 Synapse for Delphi 7 Professional 来获取网页的源代码 - 但请随意推荐任何组件和代码。
目标是通过将非 ASCII 字符“统一”为单个字符集来节省一些时间,这样我就可以使用相同的 Delphi 代码来处理它。
所以我正在寻找类似“在 Notepad++ 中选择全部并转换为没有 BOM 的 UTF”的内容,如果你明白我的意思的话。 ANSI 而不是 UTF8 也可以。
网页采用 3 种字符集进行编码:UTF8、“ISO-8859-1=Win 1252=ANSI”以及没有字符集规范的 HTML4,即。 htmlencoded Å
内容中的类型字符。
如果我需要编写一个 PHP 页面来进行转换,那也很好。无论是最少的代码/时间。
I'm using HttpGetText with Synapse for Delphi 7 Professional to get the source of a web page - but feel free to recommend any component and code.
The goal is to save some time by 'unifying' non-ASCII characters to a single charset, so I can process it with the same Delphi code.
So I'm looking for something similar to "Select All and Convert To UTF without BOM in Notepad++", if you know what I mean. ANSI instead of UTF8 would also be okay.
Webpages are encoded in 3 charsets: UTF8, "ISO-8859-1=Win 1252=ANSI" and straight up the alley HTML4 without charset spec, ie. htmlencoded Å
type characters in the content.
If I need to code a PHP page that does the conversion, that's fine too. Whatever is the least code / time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当您检索网页时,其
Content-Type
标头(或者有时 HTML 本身内的标记)会告诉您数据所使用的字符集。您可以使用该字符集将数据解码为 Unicode,然后可以将 Unicode 编码为处理所需的任何内容。
When you retreive a webpage, its
Content-Type
header (or sometimes a<meta>
tag inside the HTML itself) tells you which charset is being used for the data. You would decode the data to Unicode using that charset, then you can encode the Unicode to whatever you need for your processing.相反,我在使用 GpTextStream 检索 HTML 后直接进行了反向转换。使文档符合 ISO-8859-1 使得它们可以直接使用 Delphi 进行处理,从而节省了大量的代码更改。输出时,所有数据都转换为 UTF-8 :)
这是一些代码。也许不是最漂亮的解决方案,但它确实在更短的时间内完成了工作。请注意,这是用于反向转换。
I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)
Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.