PHP 找不到分割 utf-8 字符串的方法

发布于 2024-12-19 09:15:26 字数 2865 浏览 7 评论 0原文

我刚刚开始涉足 php，恐怕我需要一些帮助来弄清楚如何操作 utf-8 字符串。

我正在 ubuntu 11.10 x86，php 版本 5.3.6-13ubuntu3.2 中工作。来读取它

$file = fopen("file.txt", "r");
while(!feof($file)){
    $line = fgets($file);
    //...
}
fclose($file);

我有一个 utf-8 编码文件（vim :set编码 证实了这一点），然后我继续使用 mb_detect_encoding($line) 报告 UTF- 8
如果我执行 echo $line 我可以在浏览器中正确地看到该行（没有损坏的字符）
- 所以我想浏览器和 apache 一切都很好。虽然我确实在 apache 配置中搜索 AddDefaultCharset 并尝试添加 http用于字符编码的元标记（以防万一）

当我尝试使用 $arr = mb_split(';',$line) 分割字符串时，结果数组的字段包含损坏的utf-8 字符（mb_detect_encoding($arr[0]) 也报告 utf-8）。

因此 echo $arr[0] 会产生类似这样的结果：ï»¿Î'Î∼Î—ÎÎ。

我尝试设置 mb_detect_order('utf-8')、mb_internal_encoding('utf-8')，但没有任何改变。我还尝试使用这个 w3 perl 手动检测 utf-8 regex 因为我在某处读到 mb_detect_encoding 有时会失败（神话？），但结果也相同。

所以我的问题是如何正确分割字符串？ mb_ 路径是否走错了？我缺少什么？

感谢您的帮助！

更新：我正在添加示例字符串和 Base64 等效项（感谢 @chris' 的建议）

1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ï»¿Î‘Î˜Î—ÎÎ‘"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="

好的，所以在执行此操作之后，之间似乎存在 77u/ 差异3. 和 5. 其中根据这是一个 utf-8 BOM 标记。那么我该如何避免呢？

更新2：我今天醒来精神焕发，考虑到你的提示，我再次尝试了它。似乎 $line=fgets($file) 正确读取第一行（没有损坏的字符），但后续的每一行都会失败。然后我对第一行和第二行进行了base64_encoded，并且77u/ bom仅出现在第一行的base64字符串上。然后我在 vim 中打开有问题的文件，并输入 :set nobomb :w 来保存不带 bom 的文件。再次启动 php 显示第一行现在也被破坏了。基于 @hakre 的 remove_utf8_bom 我添加了它的补充功能

function add_utf8_bom($str){
    $bom= "\xEF\xBB\xBF";
    return substr($str,0,3)===$bom?$str:$bom.$str;
}

，现在每行都可以正确读取。

我不太喜欢这个解决方案，因为它看起来非常非常黑客（我不敢相信整个框架/语言没有提供处理未轰炸字符串的方法）。那么您知道另一种方法吗？否则我将继续上述操作。

感谢@chris、@hakre 和@jacob 抽出时间！

更新3（解决方案）：事实证明，这毕竟是浏览器的问题：添加 header('Content-type: text/html; charset=UTF-8 ') 和元标记，例如。它还必须正确包含在部分中，否则浏览器将无法正确理解编码。感谢@jake 的建议。

这个故事的寓意：在尝试为浏览器编写代码之前，我应该更多地了解 html。感谢大家的帮助和耐心。

原文

i just started dabbling in php and i'm afraid i need some help to figure out how to manipulate utf-8 strings.

I'm working in ubuntu 11.10 x86, php version 5.3.6-13ubuntu3.2. I have a utf-8 encoded file (vim :set encoding confirms this) which i then proceed to reading it using

$file = fopen("file.txt", "r");
while(!feof($file)){
    $line = fgets($file);
    //...
}
fclose($file);

using mb_detect_encoding($line) reports UTF-8
If i do echo $line I can see the line properly (no mangled characters) in the browser
- so I guess everything is fine with browser and apache. Though i did search my apache configuration for AddDefaultCharset and tried adding http meta-tags for character encoding (just in case)

When i try to split the string using $arr = mb_split(';',$line) the fields of the resulting array contain mangled utf-8 characters (mb_detect_encoding($arr[0]) reports utf-8 as well).

So echo $arr[0] will result in something like this: ï»¿Î‘Î˜Î—ÎÎ.

I have tried setting mb_detect_order('utf-8'), mb_internal_encoding('utf-8'), but nothing changed. I also tried to manually detect utf-8 using this w3 perl regex because i read somewhere that mb_detect_encoding can sometimes fail (myth?), but results were the same as well.

So my question is how can i properly split the string? Is going down the mb_ path the wrong way? What am I missing?

Thank you for your help!

UPDATE: I'm adding sample strings and base64 equivalents (thanks to @chris' for his suggestion)

1. original string: "ΑΘΗΝΑ;ΑΙΓΑΛΕΩ;12242;37.99452;23.6889"
2. base64 encoded: "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5"
3. first part (the equivalent of "ΑΘΗΝΑ") base64 encoded before splitting: "zpHOmM6Xzp3OkQ=="
4. first part ($arr[0] after splitting): "ï»¿Î‘Î˜Î—ÎÎ‘"
5. first part after splitting base64 encoded: "77u/zpHOmM6Xzp3OkQ=="

Ok, so after doing this there seems to be a 77u/ difference between 3. and 5. which according to this is a utf-8 BOM mark. So how can i avoid it?

UPDATE 2: I woke up refreshed today and with your tips in mind i tried it again. It seems that $line=fgets($file) reads correctly the first line (no mangled chars), and fails for each subsequent line. So then i base64_encoded the first and second line, and the 77u/ bom appeared on the base64'd string of the first line only. I then opened up the offending file in vim, and entered :set nobomb :w to save the file without the bom. Firing up php again showed that the first line was also mangled now. Based on @hakre's remove_utf8_bom i added it's complementary function

function add_utf8_bom($str){
    $bom= "\xEF\xBB\xBF";
    return substr($str,0,3)===$bom?$str:$bom.$str;
}

and voila each line is read correctly now.

I do not much like this solution, as it seems very very hackish (i can't believe that an entire framework/language does not provide for a way to deal with nobombed strings). So do you know of an alternate approach? Otherwise I'll proceed with the above.

Thanks to @chris, @hakre and @jacob for their time!

UPDATE 3 (solution): It turns out after all that it was a browser thing: it was not enough to add header('Content-type: text/html; charset=UTF-8') and meta-tags like <meta http-equiv="Content-type" value="text/html; charset=UTF-8" />. It also had to be properly enclosed inside an <html><body> section or the browser would not understand the encoding correctly. Thanks to @jake for his suggestion.

Morale of the story: I should learn more about html before trying coding for the browser in the first place. Thanks for your help and patience everyone.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人心善变 2024-12-26 09:15:26

UTF-8 有一个非常好的特性，即它与 ASCII 兼容。我的意思是：

编码为 UTF-8 时，ASCII 字符保持不变，
不会将其他字符编码为 ASCII 字符

这意味着当您尝试用分号字符 ;，它是一个 ASCII 字符，您可以使用标准的单字节字符串函数。

在您的示例中，您只需使用 explode(';',$utf8encodedText) ，一切都应该按预期工作。

PS：由于UTF-8编码是无前缀，所以实际上可以使用explode () 与任何 UTF-8 编码分隔符。

PPS：您似乎尝试解析 CSV 文件。查看 fgetcsv() 函数。只要您使用 ASCII 字符作为分隔符、引号等，它就应该可以完美地处理 UTF-8 编码的字符串。

回复收藏 0 原文

岁吢 2024-12-26 09:15:26

当您在 php 中编写调试/测试脚本时，请确保输出或多或少有效的 HTML 页面。

我喜欢使用类似于以下内容的 PHP 文件：

<!DOCTYPE html>
<html>
  <head>
    <meta charset=utf-8>
    <title>Test page for project XY</title>
  </head>
  <body>
     <h1>Test Page</h1>
     <pre><?php
        echo print_r($_GET,1);
     ?></pre>
  </body>
</html>

如果不包含任何 HTML 标记，浏览器可能会将文件解释为文本文件，并且可能会发生各种奇怪的事情。在您的情况下，我假设浏览器将该文件解释为 Latin1 编码的文本文件。我认为它可以与 BOM 配合使用，因为只要存在 BOM，浏览器就会将该文件识别为 UTF-8 文件。

When you write debug/testing scripts in php, make sure you output a more or less valid HTML page.

I like to use a PHP file similar to the following:

<!DOCTYPE html>
<html>
  <head>
    <meta charset=utf-8>
    <title>Test page for project XY</title>
  </head>
  <body>
     <h1>Test Page</h1>
     <pre><?php
        echo print_r($_GET,1);
     ?></pre>
  </body>
</html>

If you don't include any HTML tags, the browser might interpret the file as a text file and all kinds of weird things could happen. In your case, I assume the browser interpreted the file as a Latin1 encoded text file. I assume it worked with the BOM, because whenever the BOM was present, the browser recognized the file as a UTF-8 file.

回复收藏 0 原文

人事已非 2024-12-26 09:15:26

编辑，我刚刚仔细阅读了您的帖子。您建议这应该输出 false，因为您建议 BOM 是由 mb_split() 引入的。

header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);

$peices = mb_split(';', $str);

var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);

是吗？它按我的预期工作（ bool true，并且数组中的字符串是正确的）

Edit, I just read your post closer. You're suggesting this should output false, because you're suggesting a BOM was introduced by mb_split().

header('content-type: text/plain;charset=utf-8');
$s = "zpHOmM6Xzp3OkTvOkc6ZzpPOkc6bzpXOqTsxMjI0MjszNy45OTQ1MjsyMy42ODg5";
$str = base64_decode($s);

$peices = mb_split(';', $str);

var_dump(substr($str, 0, 10) === $peices[0]);
var_dump($peices);

Does it? It works as expected for me( bool true, and the strings in the array are correct)

回复收藏 0 原文

无敌元气妹 2024-12-26 09:15:26

mb_split^Docs 函数应该没问题，但是您应该使用 mb_regex_encoding^文档：

mb_regex_encoding('UTF-8');

关于mb_detect_encoding^文档：它可能会失败，但这只是因为您永远无法检测到编码。你要么知道，要么可以尝试，但仅此而已。编码检测主要是一个赌博游戏，但是您可以在该函数中使用严格参数并指定您要查找的编码。

如何删除 BOM 掩码：

您可以使用一个小辅助函数过滤字符串输入并删除 UTF-8 bom：

/**
 * remove UTF-8 BOM if string has it at the beginning
 *
 * @param string $str
 * @return string
 */
function remove_utf8_bom($str)
{
   if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF") 
   {
       $str = substr($str, 3);
   }
   return $str;
}

用法：

$line = remove_utf8_bom($line);

可能有更好的方法来做到这一点，但这应该可行。

The mb_split^Docs function should be fine, but you should define the charset it's using as well with mb_regex_encoding^Docs:

mb_regex_encoding('UTF-8');

About mb_detect_encoding^Docs: it can fail, but that's just by the fact that you can never detect an encoding. You either know it or you can try but that's all. Encoding detection is mostly a gambling game, however you can use the strict parameter with that function and specify the encoding(s) you're looking for.

How to remove the BOM mask:

You can filter the string input and remove a UTF-8 bom with a small helper function:

/**
 * remove UTF-8 BOM if string has it at the beginning
 *
 * @param string $str
 * @return string
 */
function remove_utf8_bom($str)
{
   if ($bytes = substr($str, 0, 3) && $bytes === "\xEF\xBB\xBF") 
   {
       $str = substr($str, 3);
   }
   return $str;
}

Usage: