使用 PHP 解码 Java 的 JSON Unicode 值

发布于 2024-09-29 22:36:35 字数 745 浏览 8 评论 0原文

根据过去使用的语言,我经历过同一字符串的不同 JSON 编码值。由于 API 是在封闭环境中使用的(不允许第三方),我们做出了妥协,所有 Java 应用程序都手动编码 Unicode 字符。 LinkedIn 的 API 返回“损坏”的值,与我们的 Java 应用程序基本相同。我已经在他们的论坛上发布了问题,我在这里问的原因也是很简单;因此,这个问题与 LinkedIn 部分相关,但主要是试图找到下面描述的一般编码问题的答案。

正如您所看到的,我的姓氏包含一个字母 ž,它应该是 \u017e 但 Java(或 LinkedIn 的 API)返回 \u009e code> 带有 JSON,没有任何 XML 响应。 PHP 的 json_decode() 忽略它,我的姓氏变成了 Kurida。

经过调查,我发现 ž 显然有两种表示, 9e17e。这里究竟发生了什么?这个问题有解决办法吗?

I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.

As you can see, my last name contains a letter ž, which should be \u017e but Java (or LinkedIn's API for that matter) returns \u009e with JSON and nothing with XML response. PHP's json_decode() ignores it and my last name becomes Kurida.

After an investigation, I've found ž apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

笑叹一世浮沉 2024-10-06 22:36:35

U+009E 是一个通常不可见的控制字符,不是 ž 可接受的替代表示形式。

字节 0x9E 代表 Windows 代码页 1252 中的字符 ž。如果使用 ISO-8859-1 解码,该字节将变成 U+009E。

(造成混乱的原因是,如果您在 HTML 页面中编写 ž ,浏览器实际上不会像您所期望的那样为您提供字符 U+009E,而是将其转换为U+017E。所有字符引用 0080-009F 也是如此:它们被更改为引用 cp1252 字节而不是 Unicode 字符的数字,这是完全奇怪和错误的行为,但所有主要浏览器都这样做,所以我们现在,除了在正确的 XHTML 中充当 XML 之外,因为它必须遵循更明智的 XML 规则。)

查看论坛页面,JSON 读取显然没有错误:您的名字被注册为“David”。库里德[U+009E]a”。然而,这些数据已经进入他们的系统,需要查看。

U+009E is a usually-invisible control character and not an acceptable alternative representation for ž.

The byte 0x9E represents the character ž in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.

(The confusion comes from the fact that if you write ž in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)

Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文