fgetcsv() 删除带变音符号的字符(即非 ASCII) - 如何修复?

发布于 2024-09-17 18:54:05 字数 1331 浏览 4 评论 0原文

类似问题:
PHP 期间不会读取 CSV 文件中的某些字符fgetcsv() ,
fgetcsv() 忽略特殊字符当它们位于行首时

我的应用程序有一个表单,用户可以在其中上传 CSV 文件(其 5 个内部用户始终上传有效文件 - 以逗号分隔、引用、以 LF 结尾的记录),并且然后使用 PHP 将文件导入到数据库中:

$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
    print_r($row);
    // further code not relevant as the data is already corrupt at this point
}

由于我无法更改的原因,用户正在上传以 Windows-1250 字符集(一种单字节、8 位字符编码)编码的文件。

问题:一些(不是全部!)超过 127(“扩展 ASCII”)的字符在 fgetcsv() 中被删除。示例数据:

"15","Ústav"
"420","Špičák"
"7","Tmaň"

变为

Array (
  0 => 15
  1 => "stav"
)
Array (
  0 => 420
  1 => "pičák"
)
Array (
  0 => 7
  1 => "Tma"
)

(请注意,保留 č,但删除 Ú

fgetcsv 表示“从 4.3.5 开始,fgetcsv() 现在是二进制安全的”,但看起来并非如此。我是否做错了什么,或者这个函数是否损坏,我应该寻找不同的方法来解析 CSV?

Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() ,
fgetcsv() ignores special characters when they are at the beginning of line

My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:

$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
    print_r($row);
    // further code not relevant as the data is already corrupt at this point
}

For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset - a single-byte, 8-bit character encoding.

The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv(). Example data:

"15","Ústav"
"420","Špičák"
"7","Tmaň"

becomes

Array (
  0 => 15
  1 => "stav"
)
Array (
  0 => 420
  1 => "pičák"
)
Array (
  0 => 7
  1 => "Tma"
)

(Note that č is kept, but Ú is dropped)

The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

九公里浅绿 2024-09-24 18:54:05

事实证明,我没有充分阅读文档 - fgetcsv() 只是在某种程度上二进制安全的。对于纯 ASCII < 来说是安全的。 127,但是文档还说

注意:

考虑区域设置
通过这个函数。如果 LANG 是例如
en_US.UTF-8,一字节文件
编码被读取错误
功能

fgetcsv() 试图做到二进制安全,但实际上不是(因为它同时也弄乱了字符集),并且它可能会破坏它读取的数据(因为此设置不是在 php.ini 中配置,而是从 $LANG 读取)。

我通过阅读带有 fgets 的行(适用于字节,而不是字符)并使用 文档中注释中的 CSV 函数 将它们解析为数组:

$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle)) { // fgets is actually binary safe
    $row = csvstring_to_array($raw_row, ',', '"', "\n");
    // $row is now read correctly
}

It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:

Note:

Locale setting is taken into account
by this function. If LANG is e.g.
en_US.UTF-8, files in one-byte
encoding are read wrong by this
function

In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).

I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:

$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle)) { // fgets is actually binary safe
    $row = csvstring_to_array($raw_row, ',', '"', "\n");
    // $row is now read correctly
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文