fgetcsv() 删除带变音符号的字符(即非 ASCII) - 如何修复?
类似问题:
PHP 期间不会读取 CSV 文件中的某些字符fgetcsv() ,
fgetcsv() 忽略特殊字符当它们位于行首时
我的应用程序有一个表单,用户可以在其中上传 CSV 文件(其 5 个内部用户始终上传有效文件 - 以逗号分隔、引用、以 LF 结尾的记录),并且然后使用 PHP 将文件导入到数据库中:
$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
print_r($row);
// further code not relevant as the data is already corrupt at this point
}
由于我无法更改的原因,用户正在上传以 Windows-1250
字符集(一种单字节、8 位字符编码)编码的文件。
问题:一些(不是全部!)超过 127(“扩展 ASCII”)的字符在 fgetcsv()
中被删除。示例数据:
"15","Ústav"
"420","Špičák"
"7","Tmaň"
变为
Array (
0 => 15
1 => "stav"
)
Array (
0 => 420
1 => "pičák"
)
Array (
0 => 7
1 => "Tma"
)
(请注意,保留 č
,但删除 Ú
)
fgetcsv 表示“从 4.3.5 开始,fgetcsv() 现在是二进制安全的”,但看起来并非如此。我是否做错了什么,或者这个函数是否损坏,我应该寻找不同的方法来解析 CSV?
Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() ,
fgetcsv() ignores special characters when they are at the beginning of line
My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:
$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
print_r($row);
// further code not relevant as the data is already corrupt at this point
}
For reasons I cannot change, the users are uploading the file encoded in the Windows-1250
charset - a single-byte, 8-bit character encoding.
The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv()
. Example data:
"15","Ústav"
"420","Špičák"
"7","Tmaň"
becomes
Array (
0 => 15
1 => "stav"
)
Array (
0 => 420
1 => "pičák"
)
Array (
0 => 7
1 => "Tma"
)
(Note that č
is kept, but Ú
is dropped)
The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
事实证明,我没有充分阅读文档 - fgetcsv() 只是在某种程度上二进制安全的。对于纯 ASCII < 来说是安全的。 127,但是文档还说:
,fgetcsv() 试图做到二进制安全,但实际上不是(因为它同时也弄乱了字符集),并且它可能会破坏它读取的数据(因为此设置不是在 php.ini 中配置,而是从
$LANG
读取)。我通过阅读带有
fgets
的行(适用于字节,而不是字符)并使用 文档中注释中的 CSV 函数 将它们解析为数组:It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:
In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from
$LANG
).I've sidestepped the issue by reading the lines with
fgets
(which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array: