php utf-8编码问题

发布于 2024-10-20 09:42:13 字数 558 浏览 4 评论 0原文

大家好：我在这里遇到了一个棘手的问题：我需要读取一些文件并将其内容转换为一些 XML 文件。对于文件中的每一行，我相信其中大多数都是有效的 ASCII 代码，因此我可以将该行读入 php 并将其保存到 XML 文件中，默认编码 XML 为“UTF-8”。但是，我注意到原始文件中可能存在一些GBK、GB2312（汉字）、SJIS（日文字符）等，php直接将字符串保存到XML中没有问题。但是，XML 解析器会检测到无效的 UTF-8 代码并崩溃。

现在，我认为最适合我目的的 php 库函数可能是：

 $decode_str = mb_convert_encoding($str, 'UTF-8', 'auto');

在将每一行插入 XML 之前，我尝试为每一行运行此对话函数。但是，当我使用一些 UTF-16 和 GBK 编码进行测试时，我认为该函数无法正确区分输入字符串编码模式。

另外，我尝试使用 CDATA 来包装字符串，奇怪的是 XML 解析器仍然抱怨无效的 UTF-8 代码等。当然，当我 vim xml 文件时，CDATA 里面的内容肯定是一团糟。

有什么建议吗？

原文

Hi All:
I met a tricky problem here: I need to read some files and convert its content into some XML files. For each line in the file, I believe most of them are valid ASCII code, so that I could just read the line into php and save the line into an XML file with default encoding XML as 'UTF-8'. However, I noticed that there might be some GBK, GB2312(Chinese character), SJIS(Japanese characters) etc.. existed in the original files, php have no problems to save the string into XML directly. However, the XML parser will detect there are invalid UTF-8 codes and crashed.

Now, I think the best library php function for my purpose is probably:

 $decode_str = mb_convert_encoding($str, 'UTF-8', 'auto');

I try to run this conversation function for each line before inserting it into XML. However, as I tested with some UTF-16 and GBK encoding, I don't think this function could correctly discriminate the input string encoding schema.

In addition, I tried to use CDATA to wrap the string, it's weird that the XML parser still complain about invalid UTF-8 codes etc.. of course, when I vim the xml file, what's inside the CDATA is a mess for sure.

Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

童话里做英雄 2024-10-27 09:42:13

我曾经花了很多时间创建一个安全的 UTF8 编码函数< /a>：

function _convert($content) {
    if(!mb_check_encoding($content, 'UTF-8')
        OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

        $content = mb_convert_encoding($content, 'UTF-8');

        if (mb_check_encoding($content, 'UTF-8')) {
            // log('Converted to UTF-8');
        } else {
            // log('Could not be converted to UTF-8');
        }
    }
    return $content;
}

主要问题是找出输入字符串已经使用的编码。请告诉我我的解决方案是否也适合您！

I spend once a lot of time to create a safe UTF8 encoding function:

function _convert($content) {
    if(!mb_check_encoding($content, 'UTF-8')
        OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

        $content = mb_convert_encoding($content, 'UTF-8');

        if (mb_check_encoding($content, 'UTF-8')) {
            // log('Converted to UTF-8');
        } else {
            // log('Could not be converted to UTF-8');
        }
    }
    return $content;
}

The main problem was to figure out which encoding the input string is already using. Please tell me if my solution works for you as well!

回复收藏 0 原文

梦里的微风 2024-10-27 09:42:13

我在使用 json_encode 时遇到了这个问题。我用它把所有东西都转成utf8。
来源： https://www.php.net/manual/en/ function.json-encode.php

function ascii_to_entities($str) 
    { 
       $count    = 1; 
       $out    = ''; 
       $temp    = array(); 
    
       for ($i = 0, $s = strlen($str); $i < $s; $i++) 
       { 
           $ordinal = ord($str[$i]); 
    
           if ($ordinal < 128) 
           { 
                if (count($temp) == 1) 
                { 
                    $out  .= '&#'.array_shift($temp).';'; 
                    $count = 1; 
                } 
            
                $out .= $str[$i]; 
           } 
           else 
           { 
               if (count($temp) == 0) 
               { 
                   $count = ($ordinal < 224) ? 2 : 3; 
               } 
        
               $temp[] = $ordinal; 
        
               if (count($temp) == $count) 
               { 
                   $number = ($count == 3) ? (($temp['0'] % 16) * 4096) + 
(($temp['1'] % 64) * 64) + 
($temp['2'] % 64) : (($temp['0'] % 32) * 64) + 
($temp['1'] % 64); 

                   $out .= '&#'.$number.';'; 
                   $count = 1; 
                   $temp = array(); 
               } 
           } 
       } 

       return $out; 
    }

I ran into this problem while using json_encode. I use this to get everything into utf8.
Source: https://www.php.net/manual/en/function.json-encode.php

function ascii_to_entities($str) 
    { 
       $count    = 1; 
       $out    = ''; 
       $temp    = array(); 
    
       for ($i = 0, $s = strlen($str); $i < $s; $i++) 
       { 
           $ordinal = ord($str[$i]); 
    
           if ($ordinal < 128) 
           { 
                if (count($temp) == 1) 
                { 
                    $out  .= '&#'.array_shift($temp).';'; 
                    $count = 1; 
                } 
            
                $out .= $str[$i]; 
           } 
           else 
           { 
               if (count($temp) == 0) 
               { 
                   $count = ($ordinal < 224) ? 2 : 3; 
               } 
        
               $temp[] = $ordinal; 
        
               if (count($temp) == $count) 
               { 
                   $number = ($count == 3) ? (($temp['0'] % 16) * 4096) + 
(($temp['1'] % 64) * 64) + 
($temp['2'] % 64) : (($temp['0'] % 32) * 64) + 
($temp['1'] % 64); 

                   $out .= '&#'.$number.';'; 
                   $count = 1; 
                   $temp = array(); 
               } 
           } 
       } 

       return $out; 
    }

回复收藏 0 原文

~没有更多了~