PHP中如何处理xls文件的不同编码?

发布于 2024-10-24 23:07:01 字数 329 浏览 0 评论 0原文

我正在开发一个涉及从 xls 文件解析数据的 php 脚本。我正在使用库 phpexcelreader。一切大部分都有效,但我偶然发现了一个奇怪的问题。某些文件解析不正确。看起来 xls 文件可能在内部使用不同的字符编码。至少,然后我通过 iconv -f cp1251 -t utf8 从脚本中输出,字符串得到纠正。

Phpexcelreader 有一个用于指定输出编码的选项,但看起来它缺乏检测输入编码的能力。有什么想法吗?

I'm developing a php script involving parsing data from xls files. I'm using library phpexcelreader. All mostly works, but I stumbled upon a strange problem. Some files are parsed incorrecty. Looks like xls files may use different character encodings internally. At least, then I pipe output from my script through iconv -f cp1251 -t utf8, strings get corrected.

Phpexcelreader has an option for specifing output encoding, but looks like it lacks an ability detect input encoding. Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

つ低調成傷 2024-10-31 23:07:01

可以将工作簿对象的 _defaultEncoding 属性设置为包含 Excel 文件使用的字符集,然后读者用它来处理到 UTF-16LE 的转换,但它不努力识别内部字符集本身。

如果您

define('SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE',  0x0042);

在其他 SPREADSHEET_EXCEL_READER_TYPE 定义中进行定义,然后修改从第 464 行开始的 switch 语句以包含 SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE 的情况。这种情况的逻辑需要类似于:

$length = $this->_GetInt2d($this->_data, $pos + 2);
$recordData = substr($this->_data, $pos + 4, $length);

// move stream pointer to next record
$pos += 4 + $length;

// offset: 0; size: 2; code page identifier
$codepage = $this->_GetInt2d($recordData, 0);
$codepage = $this->_CodePageNumberToName($codepage)

重新创建 _GetInt2d 方法(似乎在某些时候已从代码中删除)并

function _GetInt2d($data, $pos)
{
    return ord($data[$pos]) | (ord($data[$pos + 1]) << 8);
}

创建一个 _CodePageNumberToName 方法以从其数值返回代码页名称:

function _CodePageNumberToName($codePage = '1252')
{
    switch ($codePage) {
        case 367:   return 'ASCII';     break;  //  ASCII
        case 437:   return 'CP437';     break;  //  OEM US
        case 720:   throw new Exception('Code page 720 not supported.');
                                        break;  //  OEM Arabic
        case 737:   return 'CP737';     break;  //  OEM Greek
        case 775:   return 'CP775';     break;  //  OEM Baltic
        case 850:   return 'CP850';     break;  //  OEM Latin I
        case 852:   return 'CP852';     break;  //  OEM Latin II (Central European)
        case 855:   return 'CP855';     break;  //  OEM Cyrillic
        case 857:   return 'CP857';     break;  //  OEM Turkish
        case 858:   return 'CP858';     break;  //  OEM Multilingual Latin I with Euro
        case 860:   return 'CP860';     break;  //  OEM Portugese
        case 861:   return 'CP861';     break;  //  OEM Icelandic
        case 862:   return 'CP862';     break;  //  OEM Hebrew
        case 863:   return 'CP863';     break;  //  OEM Canadian (French)
        case 864:   return 'CP864';     break;  //  OEM Arabic
        case 865:   return 'CP865';     break;  //  OEM Nordic
        case 866:   return 'CP866';     break;  //  OEM Cyrillic (Russian)
        case 869:   return 'CP869';     break;  //  OEM Greek (Modern)
        case 874:   return 'CP874';     break;  //  ANSI Thai
        case 932:   return 'CP932';     break;  //  ANSI Japanese Shift-JIS
        case 936:   return 'CP936';     break;  //  ANSI Chinese Simplified GBK
        case 949:   return 'CP949';     break;  //  ANSI Korean (Wansung)
        case 950:   return 'CP950';     break;  //  ANSI Chinese Traditional BIG5
        case 1200:  return 'UTF-16LE';  break;  //  UTF-16 (BIFF8)
        case 1250:  return 'CP1250';    break;  //  ANSI Latin II (Central European)
        case 1251:  return 'CP1251';    break;  //  ANSI Cyrillic
        case 0:                                 //  CodePage is not always correctly set when the xls file was saved by Apple's Numbers program
        case 1252:  return 'CP1252';    break;  //  ANSI Latin I (BIFF4-BIFF7)
        case 1253:  return 'CP1253';    break;  //  ANSI Greek
        case 1254:  return 'CP1254';    break;  //  ANSI Turkish
        case 1255:  return 'CP1255';    break;  //  ANSI Hebrew
        case 1256:  return 'CP1256';    break;  //  ANSI Arabic
        case 1257:  return 'CP1257';    break;  //  ANSI Baltic
        case 1258:  return 'CP1258';    break;  //  ANSI Vietnamese
        case 1361:  return 'CP1361';    break;  //  ANSI Korean (Johab)
        case 10000: return 'MAC';       break;  //  Apple Roman
        case 32768: return 'MAC';       break;  //  Apple Roman
        case 32769: throw new Exception('Code page 32769 not supported.');
                                        break;  //  ANSI Latin I (BIFF2-BIFF3)
        case 65001: return 'UTF-8';     break;  //  Unicode (UTF-8)
    }
}

并存储返回的值 器

或者,切换到可以首先正确处理代码页的 Excel 阅读

The _defaultEncoding property of the workbook object can be set to contain the charset used by the Excel file, and this is then used to handle conversion to UTF-16LE by the reader, but it makes no effort to identify the internal charset itself.

If you define

define('SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE',  0x0042);

among the other SPREADSHEET_EXCEL_READER_TYPE definitions, and then modify the switch statement starting at line 464 to include a case for SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE. The logic for this case needs to be something like:

$length = $this->_GetInt2d($this->_data, $pos + 2);
$recordData = substr($this->_data, $pos + 4, $length);

// move stream pointer to next record
$pos += 4 + $length;

// offset: 0; size: 2; code page identifier
$codepage = $this->_GetInt2d($recordData, 0);
$codepage = $this->_CodePageNumberToName($codepage)

Recreate the _GetInt2d method (that seems to have been stripped from the code at some point) as

function _GetInt2d($data, $pos)
{
    return ord($data[$pos]) | (ord($data[$pos + 1]) << 8);
}

and create a _CodePageNumberToName method to return the codepage name from its numeric value:

function _CodePageNumberToName($codePage = '1252')
{
    switch ($codePage) {
        case 367:   return 'ASCII';     break;  //  ASCII
        case 437:   return 'CP437';     break;  //  OEM US
        case 720:   throw new Exception('Code page 720 not supported.');
                                        break;  //  OEM Arabic
        case 737:   return 'CP737';     break;  //  OEM Greek
        case 775:   return 'CP775';     break;  //  OEM Baltic
        case 850:   return 'CP850';     break;  //  OEM Latin I
        case 852:   return 'CP852';     break;  //  OEM Latin II (Central European)
        case 855:   return 'CP855';     break;  //  OEM Cyrillic
        case 857:   return 'CP857';     break;  //  OEM Turkish
        case 858:   return 'CP858';     break;  //  OEM Multilingual Latin I with Euro
        case 860:   return 'CP860';     break;  //  OEM Portugese
        case 861:   return 'CP861';     break;  //  OEM Icelandic
        case 862:   return 'CP862';     break;  //  OEM Hebrew
        case 863:   return 'CP863';     break;  //  OEM Canadian (French)
        case 864:   return 'CP864';     break;  //  OEM Arabic
        case 865:   return 'CP865';     break;  //  OEM Nordic
        case 866:   return 'CP866';     break;  //  OEM Cyrillic (Russian)
        case 869:   return 'CP869';     break;  //  OEM Greek (Modern)
        case 874:   return 'CP874';     break;  //  ANSI Thai
        case 932:   return 'CP932';     break;  //  ANSI Japanese Shift-JIS
        case 936:   return 'CP936';     break;  //  ANSI Chinese Simplified GBK
        case 949:   return 'CP949';     break;  //  ANSI Korean (Wansung)
        case 950:   return 'CP950';     break;  //  ANSI Chinese Traditional BIG5
        case 1200:  return 'UTF-16LE';  break;  //  UTF-16 (BIFF8)
        case 1250:  return 'CP1250';    break;  //  ANSI Latin II (Central European)
        case 1251:  return 'CP1251';    break;  //  ANSI Cyrillic
        case 0:                                 //  CodePage is not always correctly set when the xls file was saved by Apple's Numbers program
        case 1252:  return 'CP1252';    break;  //  ANSI Latin I (BIFF4-BIFF7)
        case 1253:  return 'CP1253';    break;  //  ANSI Greek
        case 1254:  return 'CP1254';    break;  //  ANSI Turkish
        case 1255:  return 'CP1255';    break;  //  ANSI Hebrew
        case 1256:  return 'CP1256';    break;  //  ANSI Arabic
        case 1257:  return 'CP1257';    break;  //  ANSI Baltic
        case 1258:  return 'CP1258';    break;  //  ANSI Vietnamese
        case 1361:  return 'CP1361';    break;  //  ANSI Korean (Johab)
        case 10000: return 'MAC';       break;  //  Apple Roman
        case 32768: return 'MAC';       break;  //  Apple Roman
        case 32769: throw new Exception('Code page 32769 not supported.');
                                        break;  //  ANSI Latin I (BIFF2-BIFF3)
        case 65001: return 'UTF-8';     break;  //  Unicode (UTF-8)
    }
}

And store the returned value in $_defaultEncoding

Alternatively, switch to an Excel reader that can handle the codepage correctly in the first place

星軌x 2024-10-31 23:07:01

我的2美分:

我刚刚用这个替换了encodeUTF16

 function _encodeUTF16($string, $check = false) {
    if ($check) {
        $from = api_detect_encoding($string);
        $string = api_convert_encoding($string, $this->_defaultEncoding, $from);
        return $string;    
    }
    $string =  api_convert_encoding($string, $this->_defaultEncoding, 'UTF-16LE');
    return $string;

,并更改了第568行,

$retstr = ($asciiEncoding) ? $this->_encodeUTF16($retstr, true) : $this->_encodeUTF16($retstr);

这些函数 api_detect_encoding 和 api_convert_encoding 可以在此库中找到:

http://code.google.com/p/chamilo/source/browse/main/inc/lib/internationalization。 lib.php?repo=经典

My 2 cents:

I just replaced the encodeUTF16 with this one

 function _encodeUTF16($string, $check = false) {
    if ($check) {
        $from = api_detect_encoding($string);
        $string = api_convert_encoding($string, $this->_defaultEncoding, $from);
        return $string;    
    }
    $string =  api_convert_encoding($string, $this->_defaultEncoding, 'UTF-16LE');
    return $string;

and change the line 568

$retstr = ($asciiEncoding) ? $this->_encodeUTF16($retstr, true) : $this->_encodeUTF16($retstr);

these functions api_detect_encoding and api_convert_encoding can be found in this lib:

http://code.google.com/p/chamilo/source/browse/main/inc/lib/internationalization.lib.php?repo=classic

童话 2024-10-31 23:07:01

对于波斯语,我在第 568 行或其他版本 336 后添加了一行。

$retstr = ($asciiEncoding) ? $retstr : $this->_encodeUTF16($retstr);

$retstr=iconv("UTF-16LE","UTF-8", $retstr);

此代码支持波斯语,但你不能再使用英语了。

for Persian language i added one line after the line 568 or in other version 336.

$retstr = ($asciiEncoding) ? $retstr : $this->_encodeUTF16($retstr);

$retstr=iconv("UTF-16LE","UTF-8", $retstr);

this code support Persian but u can't use English any more.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文