如何检测是否必须对字符串应用 UTF-8 解码或编码?
我有一个来自第三方网站的提要,有时我必须应用 utf8_decode
,有时则必须应用 utf8_encode
才能获得所需的可见输出。
如果错误地应用了两次相同的东西/或者使用了错误的方法,我会得到更难看的东西,这就是我想要改变的。
如何检测何时必须在字符串上应用什么?
实际上内容返回UTF-8,但里面有部分不是。
I have a feed taken from third-party sites, and sometimes I have to apply utf8_decode
and other times utf8_encode
to get the desired visible output.
If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.
How can I detect when what have to apply on the string?
Actually the content returns UTF-8, but inside there are parts that are not.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我不能说我可以依赖
mb_detect_encoding()
。不久前我遇到了一些奇怪的误报。我发现在每种情况下都有效的最通用的方法是:
I can't say I can rely on
mb_detect_encoding()
. I had some freaky false positives a while back.The most universal way I found to work well in every case was:
您可以使用
mb_detect_encoding
— 检测字符编码该字符集也可能在 HTTP 响应中可用headers 或响应数据本身。
示例:
输出(codepad):
You can use
mb_detect_encoding
— Detect character encodingThe character set might also be available in the HTTP response headers or in the response data itself.
Example:
Output (codepad):
编码自动检测并不是万无一失的,但您可以尝试
mb_detect_encoding()
。另请参阅mb_check_encoding()
。Encoding autotection is not bullet-proof but you can try
mb_detect_encoding()
. See alsomb_check_encoding()
.feed(我猜你的意思是某种基于 XML 的 feed)应该在标头中有一个属性,告诉你编码是什么。如果没有,那么您就不走运了,因为您没有可靠的方法来识别编码。
The feed (I guess you mean some kind of XML-based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.