PostgreSQL + PHP + UTF8 = 编码的无效字节序列

发布于 2024-08-09 13:09:08 字数 535 浏览 3 评论 0原文

我正在将数据库从 mysql 迁移到 postgresql。 mysql 数据库的默认排序规则是 UTF8,postgres 也使用 UTF8,我使用 pg_escape_string() 对数据进行编码。然而,无论出于何种原因,我遇到了一些关于错误编码的奇怪错误:

pg_query() [function.pg-query]: 查询失败:错误:编码“UTF8”的字节序列无效:0xeb7374 提示:如果字节序列与服务器期望的编码不匹配(由“客户端”控制),也会发生此错误

我一直在尝试解决这个问题,并注意到 php 正在做一些奇怪的事情;如果字符串中仅包含 ascii 字符(例如“hello”),则编码为 ASCII。如果字符串包含任何非 ascii 字符,则表示编码为 UTF8(例如“Hëllo”)。

当我在已经是 UTF8 的字符串上使用 utf8_encode() 时,它会杀死特殊字符并使它们全部混乱,所以..我该怎么做才能让它工作?

(现在挂起的确切字符是“�”,但我不只是搜索/替换,而是想找到一个更好的解决方案,这样这种问题就不会再发生)

I'm migrating a db from mysql to postgresql. The mysql db's default collation is UTF8, postgres is also using UTF8, and I'm encoding the data with pg_escape_string(). For whatever reason however, I'm running into some funky errors about bad encoding:

pg_query() [function.pg-query]: Query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xeb7374
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client"

I've been poking around trying to figure this out, and noticed that php is doing something weird; if a string has only ascii chars in it (eg. "hello"), the encoding is ASCII. If the string contains any non ascii chars, it says the encoding is UTF8 (eg. "Hëllo").

When I use utf8_encode() on strings that are already UTF8, it kills the special chars and makes them all messed up, so.. what can I do to get this to work?

(the exact char hanging it up right now is "�", but instead of just search/replace, i'd like to find a better solution so this kinda problem doesn't happen again)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

忘你却要生生世世 2024-08-16 13:09:08

最有可能的是,您的 MySQL 数据库中的数据不是 UTF8。这是一个很常见的场景。 MySQL 至少过去根本不对数据进行任何正确的验证,因此只要您的客户端声称它是 UTF8,它就会接受您以 UTF8 形式向其发送的任何内容。他们现在可能已经解决了这个问题(或者没有,我不知道他们是否认为这是一个问题),但是您可能已经在数据库中错误地编码了数据。当然,PostgreSQL 在加载时会执行完整验证,因此可能会失败。

您可能希望通过 iconv 之类的东西提供数据,可以将其设置为忽略未知字符,或将它们转换为“最佳猜测”。

Most likely, the data in your MySQL database isn't UTF8. It's a pretty common scenario. MySQL at least used to not do any proper validation at all on the data, so it accepted anything you threw at it as UTF8 as long as your client claimed it was UTF8. They may have fixed that by now (or not, I don't know if they even consider it a problem), but you may already have incorrectly encoded data in the db. PostgreSQL, of course, performs full validation when you load it, and thus it may fail.

You may want to feed the data through something like iconv that can be set to ignore unknown characters, or transform them to "best guess".

尝蛊 2024-08-16 13:09:08

顺便说一句,ASCII 字符串在 UTF-8 中完全相同,因为它们共享相同的前 127 个字符;因此 ASCII 中的“Hello”与 UTF-8 中的“Hello”完全相同,无需转换。

表中的排序规则可能是 UTF-8,但您可能无法以相同的编码从中获取信息。现在,如果您对提供给 pg_escape_string 的信息有疑问,可能是因为您假设从 MySQL 获取的内容是用 UTF-8 编码的,但事实并非如此。我建议您查看MySQL 文档的此页面并查看您的连接的编码;您可能正在从排序规则为 UTF-8 的表中获取数据,但您的连接类似于 Latin-1(其中诸如 çéèêöà 等特殊字符不会以 UTF-8 进行编码)。

BTW, an ASCII string is exactly the same in UTF-8 because they share the same first 127 characters; so "Hello" in ASCII is exactly the same as "Hello" in UTF-8, there's no conversion needed.

The collation in the table may be UTF-8 but you may not be fetching information from it in the same encoding. Now if you have trouble with information you give to pg_escape_string it's probably because you're assuming content fetched from MySQL is encoded in UTF-8 while it's not. I suggest you look at this page on MySQL documentation and see the encoding of your connection; you're probably fetching from a table where the collation is UTF-8 but you're connection is something like Latin-1 (where special characters such as çéèêöà etc won't be encoded in UTF-8).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文