如何将 ISO 8859-1 字符转换为 UTF-8
我使用 CURL 从另一个站点获取内容,但我不知道为什么它会自动从 UTF-8 转换为 ISO 8859-1,如下所示:
site: abc.com:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
但是当我使用 CURL 从该网站获取内容时,我得到了以下信息:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
那么如何将其转换为 UTF-8 ?
I use CURL to get content from another site, but i don't know why it's auto convert from UTF-8 to ISO 8859-1, like follow:
site: abc.com:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
But when i use CURL get content from that site, i got follow:
Cửa Hàng Chip Chip: Rộn ràng đón Giáng sinh với những vật phẩm trang trí Noel đầy màu sắc của CHIPCHIP GIFT SHOP
So how to convert it's become to UTF-8 ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我建议使用 iconv。
iconv --list 为您提供所有已知编码的列表,然后您可以使用 iconv -f FROM_ENCODING -t TO_ENCODING 进行转换。它还可以从标准输入读取,因此可以插入
curl
。但是关于您对问题的评论:似乎文件作者并不关心使用正确的编码,而是决定坚持使用(旧式?)
ä
之类的东西。I'd recommend using
iconv
.iconv --list
gives you a list of all known encodings, and you can then useiconv -f FROM_ENCODING -t TO_ENCODING
do do your conversion. It can also read from stdin and therefore be plugged tocurl
.But regarding the comment you got for your question: It seems like the file author didn't care about using the correct encoding and decided to stick with (old-style?)
ä
and stuff.将字符串放入变量中并使用以下函数。
Take your string in variable and use following function.
从您粘贴的行来看,问题似乎出在 HTML 实体上,而不是字符编码上。编码的字符对我来说看起来很好。
您需要将这些 HTML 实体转换为编码字符。使用哪种工具取决于您的环境或编程语言。我认为仅用 CURL 是无法完成的。
PHP 有 htmlspecialchars_decode()。来自 HTMLParser 模块的 Python unescape() 。
Judging from the line you pasted, the problem appears to be with HTML entities, not with character enconding. The encoded chars look fine to me.
You need to translate those HTML entities to encoded chars. Which tool to use will depend of your enviroment or programming language. I don't think it can be done with CURL alone.
PHP has htmlspecialchars_decode(). Python unescape() from the HTMLParser module.
curl 不会转换任何内容,“按原样”下载内容。
您看到的是字符实体、有效的 html 以及转换为可读形式的浏览器。
您可以通过在浏览器中打开curl 保存的文件来检查这一点。它看起来像实时页面。
curl does not convert anything, downloads things "as is"
What you see are character entities, valid html, and the browser that the conversion to a readable form.
You can check this by opening the file saved by curl in a browser. It will look like the live page.
您可以尝试以下操作:
在此处查看更多内容: html_entity_decode
You can try this:
See more here: html_entity_decode
您的文件不会转换为其他编码。他们使用 HTML 字符实体。您需要转换这些实体 ,例如
é
转换为 UTF-8,例如 é。转换为 UTF-8 后,如果您确实需要这样做,则需要额外一行代码。Your files aren’t being converted to another encoding. They’re using HTML character entities. You need to convert those entities, such as
é
to UTF-8, such as é. This takes one extra line of code after you convert to UTF-8, if you even need to do that.