PHP 找不到分割 utf-8 字符串的方法
我刚刚开始涉足 php,恐怕我需要一些帮助来弄清楚如何操作 utf-8 字符串。
我正在 ubuntu 11.10 x86,php 版本 5.3.6-13ubuntu3.2 中工作。 来读取它
$file = fopen("file.txt", "r");
while(!feof($file)){
$line = fgets($file);
//...
}
fclose($file);
- 我有一个 utf-8 编码文件(vim
:set编码
证实了这一点),然后我继续使用mb_detect_encoding($line)
报告UTF- 8
- 如果我执行
echo $line
我可以在浏览器中正确地看到该行(没有损坏的字符)- 所以我想浏览器和 apache 一切都很好。虽然我确实在 apache 配置中搜索 AddDefaultCharset 并尝试添加 http用于字符编码的元标记(以防万一)
当我尝试使用 $arr = mb_split(';',$line)
分割字符串时,结果数组的字段包含损坏的utf-8 字符(mb_detect_encoding($arr[0])
也报告 utf-8)。
因此 echo $arr[0]
会产生类似这样的结果:Î'Î∼ΗÎÎ
。
我尝试设置 mb_detect_order('utf-8')
、mb_internal_encoding('utf-8')
,但没有任何改变。我还尝试使用 这个 w3 perl 手动检测 utf-8 regex 因为我在某处读到 mb_detect_encoding 有时会失败(神话?),但结果也相同。
所以我的问题是如何正确分割字符串? mb_
路径是否走错了?我缺少什么?
感谢您的帮助!
更新:我正在添加示例字符串和 Base64 等效项(感谢 @chris' 的建议)
1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ΑΘΗÎΑ"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="
好的,所以在执行此操作之后,之间似乎存在 77u/
差异3. 和 5. 其中 根据这是一个 utf-8 BOM 标记。那么我该如何避免呢?
更新2:我今天醒来精神焕发,考虑到你的提示,我再次尝试了它。似乎 $line=fgets($file)
正确读取第一行(没有损坏的字符),但后续的每一行都会失败。然后我对第一行和第二行进行了base64_encoded
,并且77u/
bom仅出现在第一行的base64字符串上。然后我在 vim 中打开有问题的文件,并输入 :set nobomb
:w
来保存不带 bom 的文件。再次启动 php 显示第一行现在也被破坏了。基于 @hakre 的 remove_utf8_bom
我添加了它的补充功能
function add_utf8_bom($str){
$bom= "\xEF\xBB\xBF";
return substr($str,0,3)===$bom?$str:$bom.$str;
}
,现在每行都可以正确读取。
我不太喜欢这个解决方案,因为它看起来非常非常黑客(我不敢相信整个框架/语言没有提供处理未轰炸字符串的方法)。那么您知道另一种方法吗?否则我将继续上述操作。
感谢@chris、@hakre 和@jacob 抽出时间!
更新3(解决方案):事实证明,这毕竟是浏览器的问题:添加 header('Content-type: text/html; charset=UTF-8 ')
和元标记,例如 。它还必须正确包含在
部分中,否则浏览器将无法正确理解编码。感谢@jake 的建议。
这个故事的寓意:在尝试为浏览器编写代码之前,我应该更多地了解 html。感谢大家的帮助和耐心。
i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.
I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding
confirms this) which i then proceed to reading it using
$file = fopen("file.txt", "r");
while(!feof($file)){
$line = fgets($file);
//...
}
fclose($file);
- using
mb_detect_encoding($line)
reportsUTF-8
- If i do
echo $line
I can see the line properly (no mangled characters) in the browser- so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)
When i try to split the string using $arr = mb_split(';',$line)
the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0])
reports utf-8 as well).
So echo $arr[0]
will result in something like this: ΑΘΗÎÎ
.
I have tried setting mb_detect_order('utf-8')
, mb_internal_encoding('utf-8')
, but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.
So my question is how can i properly split the string? Is going down the mb_
path the wrong way? What am I missing?
Thank you for your help!
UPDATE: I'm adding sample strings and base64 equivalents (thanks to @chris' for his suggestion)
1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ΑΘΗÎΑ"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="
Ok, so after doing this there seems to be a 77u/
difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?
UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file)
reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded
the first and second line, and the 77u/
bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb
:w
to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on @hakre's remove_utf8_bom
i added it's complementary function
function add_utf8_bom($str){
$bom= "\xEF\xBB\xBF";
return substr($str,0,3)===$bom?$str:$bom.$str;
}
and voila each line is read correctly now.
I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.
Thanks to @chris, @hakre and @jacob for their time!
UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8')
and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />
. It also had to be properly enclosed inside an <html><body>
section or the browser would not understand the encoding correctly. Thanks to @jake for his suggestion.
Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
UTF-8 有一个非常好的特性,即它与 ASCII 兼容。我的意思是:
这意味着当您尝试用分号字符
;
,它是一个 ASCII 字符,您可以使用标准的单字节字符串函数。在您的示例中,您只需使用
explode(';',$utf8encodedText)
,一切都应该按预期工作。PS:由于UTF-8编码是无前缀,所以实际上可以使用
explode ()
与任何 UTF-8 编码分隔符。PPS:您似乎尝试解析 CSV 文件。查看 fgetcsv() 函数。只要您使用 ASCII 字符作为分隔符、引号等,它就应该可以完美地处理 UTF-8 编码的字符串。
UTF-8 has the very nice feature that it is ASCII-compatible. With this I mean that:
This means that when you try to split a UTF-8 string by the semicolon character
;
, which is an ASCII character, you can just use standard single byte string functions.In your example, you can just use
explode(';',$utf8encodedText)
and everything should work as expected.PS: Since the UTF-8 encoding is prefix-free, you can actually use
explode()
with any UTF-8 encoded separator.PPS: It seems like you try to parse a CSV file. Have a look at the fgetcsv() function. It should work perfectly on UTF-8 encoded strings as long as you use ASCII characters for separators, quotes, etc.
当您在 php 中编写调试/测试脚本时,请确保输出或多或少有效的 HTML 页面。
我喜欢使用类似于以下内容的 PHP 文件:
如果不包含任何 HTML 标记,浏览器可能会将文件解释为文本文件,并且可能会发生各种奇怪的事情。在您的情况下,我假设浏览器将该文件解释为 Latin1 编码的文本文件。我认为它可以与 BOM 配合使用,因为只要存在 BOM,浏览器就会将该文件识别为 UTF-8 文件。
When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.
I like to use a PHP file similar to the following:
If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.
编辑,我刚刚仔细阅读了您的帖子。您建议这应该输出 false,因为您建议 BOM 是由 mb_split() 引入的。
是吗?它按我的预期工作( bool true,并且数组中的字符串是正确的)
Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().
Does it? It works as expected for me( bool true, and the strings in the array are correct)
mb_split
Docs 函数应该没问题,但是您应该使用mb_regex_encoding
文档:关于
mb_detect_encoding
文档:它可能会失败,但这只是因为您永远无法检测到编码。你要么知道,要么可以尝试,但仅此而已。编码检测主要是一个赌博游戏,但是您可以在该函数中使用严格参数并指定您要查找的编码。如何删除 BOM 掩码:
您可以使用一个小辅助函数过滤字符串输入并删除 UTF-8 bom:
用法:
可能有更好的方法来做到这一点,但这应该可行。
The
mb_split
Docs function should be fine, but you should define the charset it's using as well withmb_regex_encoding
Docs:About
mb_detect_encoding
Docs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.How to remove the BOM mask:
You can filter the string input and remove a UTF-8 bom with a small helper function:
Usage:
There are probably better ways to do it, but this should work.