使用 PHP 解码 Java 的 JSON Unicode 值
根据过去使用的语言,我经历过同一字符串的不同 JSON 编码值。由于 API 是在封闭环境中使用的(不允许第三方),我们做出了妥协,所有 Java 应用程序都手动编码 Unicode 字符。 LinkedIn 的 API 返回“损坏”的值,与我们的 Java 应用程序基本相同。我已经在他们的论坛上发布了问题,我在这里问的原因也是很简单;因此,这个问题与 LinkedIn 部分相关,但主要是试图找到下面描述的一般编码问题的答案。
正如您所看到的,我的姓氏包含一个字母 ž
,它应该是 \u017e
但 Java(或 LinkedIn 的 API)返回 \u009e
code> 带有 JSON,没有任何 XML 响应。 PHP 的 json_decode()
忽略它,我的姓氏变成了 Kurida。
I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.
As you can see, my last name contains a letter ž
, which should be \u017e
but Java (or LinkedIn's API for that matter) returns \u009e
with JSON and nothing with XML response. PHP's json_decode()
ignores it and my last name becomes Kurida.
After an investigation, I've found ž
apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
U+009E 是一个通常不可见的控制字符,不是
ž
可接受的替代表示形式。字节 0x9E 代表 Windows 代码页 1252 中的字符
ž
。如果使用 ISO-8859-1 解码,该字节将变成 U+009E。(造成混乱的原因是,如果您在 HTML 页面中编写
ž
,浏览器实际上不会像您所期望的那样为您提供字符 U+009E,而是将其转换为U+017E。所有字符引用 0080-009F 也是如此:它们被更改为引用 cp1252 字节而不是 Unicode 字符的数字,这是完全奇怪和错误的行为,但所有主要浏览器都这样做,所以我们现在,除了在正确的 XHTML 中充当 XML 之外,因为它必须遵循更明智的 XML 规则。)查看论坛页面,JSON 读取显然没有错误:您的名字被注册为“David”。库里德[U+009E]a”。然而,这些数据已经进入他们的系统,需要查看。
U+009E is a usually-invisible control character and not an acceptable alternative representation for
ž
.The byte 0x9E represents the character
ž
in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.(The confusion comes from the fact that if you write
ž
in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.