MongoDB PHP UTF-8 问题

发布于 2024-11-05 19:02:16 字数 719 浏览 1 评论 0 原文

假设我需要插入以下文档:(

{
    title: 'Péter'
}

注意 é)

当我使用以下 PHP 代码时,它会给我一个错误...:

$db->collection->insert(array("title" => "Péter"));

...因为它需要是 utf-8。

所以我应该使用这行代码:

$db->collection->insert(array("title" => utf8_encode("Péter")));

现在,当我请求文档时,我仍然必须对其进行解码...:

$document = $db->collection->findOne(array("_id" => new MongoId("__someID__")));
$title = utf8_decode($document['title']);

有什么方法可以自动化此过程吗?我可以更改 MongoDB 的字符编码(我正在迁移使用 cp1252 West Europe (latin1) 的 MySQL 数据库吗?

我已经考虑过更改 Content-Type-header,问题是所有静态字符串(硬编码)都不是utf8...

提前致谢! 蒂姆

Assume that I need to insert the following document:

{
    title: 'Péter'
}

(note the é)

It gives me an error when I use the following PHP-code ... :

$db->collection->insert(array("title" => "Péter"));

... because it needs to be utf-8.

So I should use this line of code:

$db->collection->insert(array("title" => utf8_encode("Péter")));

Now, when I request the document, I still have to decode it ... :

$document = $db->collection->findOne(array("_id" => new MongoId("__someID__")));
$title = utf8_decode($document['title']);

Is there some way to automate this process? Can I change the character-encoding of MongoDB (I'm migrating a MySQL-database that's using cp1252 West Europe (latin1)?

I already considered changing the Content-Type-header, problem is that all static strings (hardcoded) aren't utf8...

Thanks in advance!
Tim

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清风挽心 2024-11-12 19:02:16

JSON 和 BSON 只能编码/解码有效的 UTF-8 字符串,如果您的数据(包括输入)不是 UTF-8,您需要在将其传递到任何 JSON 相关系统之前对其进行转换,如下所示:

$string = iconv('UTF-8', 'UTF-8//IGNORE', $string); // or
$string = iconv('UTF-8', 'UTF-8//TRANSLIT', $string); // or even
$string = iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $string); // not sure how this behaves

我个人更喜欢第一个选项,请参阅 iconv() 手册页。其他替代方案包括:

您应该始终确保您的字符串是 UTF-8 编码的,即使是用户提交的字符串,但是既然您提到您正在迁移从 MySQL 到 MongoDB,您是否尝试过将当前数据库导出为 CSV 并使用 Mongo 附带的导入脚本?他们应该处理这个...


编辑:我提到BSON只能处理UTF-8,但我不确定这是否完全正确,我有一个模糊的想法BSON使用UTF-16或 UTF-32 来编码/解码数据,但我现在无法检查。

JSON and BSON can only encode / decode valid UTF-8 strings, if your data (included input) is not UTF-8 you need to convert it before passing it to any JSON dependent system, like this:

$string = iconv('UTF-8', 'UTF-8//IGNORE', $string); // or
$string = iconv('UTF-8', 'UTF-8//TRANSLIT', $string); // or even
$string = iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $string); // not sure how this behaves

Personally I prefer the first option, see the iconv() manual page. Other alternatives include:

You should always make sure your strings are UTF-8 encoded, even the user-submitted ones, however since you mentioned that you're migrating from MySQL to MongoDB, have you tried exporting your current database to CSV and using the import scripts that come with Mongo? They should handle this...


EDIT: I mentioned that BSON can only handle UTF-8, but I'm not sure if this is exactly true, I have a vague idea that BSON uses UTF-16 or UTF-32 to encode / decode data, but I can't check now.

日记撕了你也走了 2024-11-12 19:02:16

正如 @gates 所说,BSON 中的所有字符串数据都编码为 UTF-8。 MongoDB 假设了这一点。

两个答案都没有提到的另一个关键点是:PHP 不支持 Unicode。无论如何,从 5.3 开始。 PHP 6 据称将支持 Unicode。这意味着您必须知道操作系统默认使用什么编码以及 PHP 使用什么编码。

让我们回到最初的问题:“有什么方法可以自动化这个过程吗?” ...我的建议是确保您在整个应用程序中始终使用 UTF-8。配置、输入、数据存储、演示,一切。然后“自动化”部分是您的大部分 PHP 代码将变得更简单,因为它始终假定 UTF-8。无需转换。哎呀,没有人说自动化很便宜。 :)

顺便说一句。如果您创建了一个小 PHP 脚本来测试 insert() 代码,请找出文件的编码方式,然后在插入之前转换为 UTF-8。例如,如果您知道文件是 ISO-8859-1,请尝试以下操作:

$title = mb_convert_encoding("Péter", "UTF-8", "ISO-8859-1");
$db->collection->insert(array("title" => $title));

另请参阅

As @gates said, all string data in BSON is encoded as UTF-8. MongoDB assumes this.

Another key point which neither answer addresses: PHP is not Unicode aware. As of 5.3, anyway. PHP 6 will supposedly be Unicode-aware. What this means is you have to know what encoding is used by your operating system by default and what encoding PHP is using.

Let's get back to your original question: "Is there some way to automate this process?" ... my suggestion is to make sure you are always using UTF-8 throughout your application. Configuration, input, data storage, presentation, everything. Then the "automated" part is that most of your PHP code will be simpler since it always assumes UTF-8. No conversions necessary. Heck, nobody said automation was cheap. :)

Here's kind of an aside. If you created a little PHP script to test that insert() code, figure out what encoding your file is, then convert to UTF-8 before inserting. For example, if you know the file is ISO-8859-1, try this:

$title = mb_convert_encoding("Péter", "UTF-8", "ISO-8859-1");
$db->collection->insert(array("title" => $title));

See also

只涨不跌 2024-11-12 19:02:16

我可以更改 MongoDB 的字符编码吗...

BSON 中不存储任何数据。根据 BSON 规范,所有字符串均为 UTF-8。

现在,当我请求该文档时,我仍然需要对其进行解码......:
有什么方法可以自动化这个过程吗?

听起来您正在尝试将数据输出到网页。需要“解码”已经编码的文本似乎是不正确的。

这个输出问题是否是 Apache+PHP 的配置问题? UTF8+PHP 不是自动的,快速在线搜索可以找到有关此主题的几个教程。

Can I change the character-encoding of MongoDB...

No data is stored in BSON. According to the BSON spec, all string are UTF-8.

Now, when I request the document, I still have to decode it ... :
Is there some way to automate this process?

It sounds like you are trying to output the data to web page. Needing to "decode" text that was already encoded seems incorrect.

Could this output problem be a configuration issue with Apache+PHP? UTF8+PHP is not automatic, a quick online search brought up several tutorials on this topic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文