当前位置：文江博客话题详情

如何检测是否必须对字符串应用 UTF-8 解码或编码？

发布于 2024-10-06 20:26:59 字数 215 浏览 1 评论 0原文

我有一个来自第三方网站的提要，有时我必须应用 utf8_decode，有时则必须应用 utf8_encode 才能获得所需的可见输出。

如果错误地应用了两次相同的东西/或者使用了错误的方法，我会得到更难看的东西，这就是我想要改变的。

如何检测何时必须在字符串上应用什么？

实际上内容返回UTF-8，但里面有部分不是。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧瑾黎汐 2024-10-13 20:26:59

我不能说我可以依赖 mb_detect_encoding()。不久前我遇到了一些奇怪的误报。

我发现在每种情况下都有效的最通用的方法是：

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

回复收藏 0 原文

独夜无伴 2024-10-13 20:26:59

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("Â« ChrÃ©tiens d'Orient Â» : la RATP fait marche arriÃ¨re"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("Â« ChrÃ©tiens d'Orient Â» : la RATP fait marche arriÃ¨re"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)

回复收藏 0 原文

唠甜嗑 2024-10-13 20:26:59

您可以使用

mb_detect_encoding — 检测字符编码

该字符集也可能在 HTTP 响应中可用headers 或响应数据本身。

示例：

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

输出（codepad）：

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

You can use

mb_detect_encoding — Detect character encoding

The character set might also be available in the HTTP response headers or in the response data itself.

Example:

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

Output (codepad):

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

回复收藏 0 原文