GSON / JSON:奇怪的特殊字符(元音变音)问题

发布于 2024-12-11 13:37:20 字数 711 浏览 2 评论 0原文

在尝试使用 GSON 处理 JSON 响应时(输出来自 flickr API,以防您询问),我遇到了我所描述的某些特殊字符的非常奇怪的编码:

原始 JSON 响应

这是其十六进制视图:

十六进制视图原始 JSON 响应

“u”后跟“双点”应该是德语“ü”,这就是我的困惑开始的地方。就好像有人把这个字符撕成两半,对两部分分别进行编码。下图显示了我期望的十六进制编码,以防“ü”正确编码:

Expected Hex View

更奇怪的是,在我预计会出现问题的情况下(即亚洲字符集),一切似乎都工作正常,例如“title”:“ナガreテユク・・・”

问题:

  1. 是一些 flickrAPI 的奇怪之处或响应的正确 JSON 编码?或者它是正确编码的 JSON,而 GSON 未能将此响应“重新组装”为原始“ü”。还是标题信息的作者只是搞砸了?
  2. 我该如何解决这个问题(如果是 JSON 或 GS​​ON 搞乱了,如果是作者的话显然不能做任何事情)。我如何知道哪些“其他”字符受到影响(我想到了 ö 和 ä,但可能还有更多“特殊情况”)。

While trying to process a JSON response with GSON (the output is from the flickr API in case you're asking) I encountered what I'd describe as a pretty weird encoding of certain special chars:

Original JSON response

Here's a hex view of it:

Hex View of Original JSON response

The 'u' followed by the 'double-dots' is what's supposed to be a German 'ü', and this is where my confusion starts. It's as if someone took the char and ripped it in half, encoding each of the 2 pieces. The following image shows the hex encoding of what I'd expect it to be in case the 'ü' was correctly encoded:

Expected Hex View

Even more weird, in cases where I would expect problems to occur (namely, the Asian character set) everything seems to work fine, e.g. "title": "ナガレテユク・・・"

Questions:

  1. Is that some flickrAPI oddity or correct JSON encoding for the reposonse? Or is it rather correctly encoded JSON and it's GSON that's failing to 're-assemble' this response into the original 'ü'. Or did the author of the title message simply screw it on his part?
  2. How do I solve the problem (in case it's either JSON or GSON that's messing around, can't obviously do anything if it was the author). How do I know what 'other' chars are affected (ö and ä come to mind, but there are probably more 'special cases').

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

始于初秋 2024-12-18 13:37:20

您所看到的情况是 Unicode 分解

像德语元音变音这样的字符可以是有两种表达方式:

  • 更传统的预组合形式为单个字符 ü
  • 分解形式为基本字符 u 后跟 结合分音符 ̈_ (我不得不使用这里的下划线让它显示出来,因为它不应该单独存在,它实际上只是“悬停点”)

如果你收到这样的东西,它可以通过使用 java.text.Normalizer (自 Java 1.6 起可用):

String decomposed = "Mitgef\u0308hl";
printChars(decomposed); // Mitgefühl -- [M, i, t, g, e, f, u, ̈, h, l]
String precomposed = Normalizer.normalize(decomposed, Form.NFC);
printChars(precomposed); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]

// Normalizing with NFC again doesn't hurt:
String precomposedAgain = Normalizer.normalize(precomposed, Form.NFC);
printChars(precomposedAgain); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]
...

static void printChars(String s) {
  System.out.println(s + " -- " + Arrays.toString(s.toCharArray()));
}

如您所见,应用 NFC 到已经预先组合好的字符串不会有什么坏处。

请注意,打印 String 在任何支持 Unicode 的终端上都会正确显示,只有当您打印字符数组时,您才能看到分解形式和预组合形式之间的差异。

一个可能的来源可能是 MacOS,它倾向于以分解的形式对事物进行编码,但奇怪的是 Flickr 并没有规范化这些东西。

What you're seeing there is a case of Unicode decomposition:

Characters like German umlauts can be expressed in two ways:

  • the more traditional precomposed form as a single character ü or
  • in decomposed form as base character u followed by a combining diaeresis ̈_ (I had to use an underscore here to make it show up because it's not supposed to stand alone, it's really just the to "hovering dots")

If you receive something like this, it's easily converted into precomposed form by using java.text.Normalizer (available since Java 1.6):

String decomposed = "Mitgef\u0308hl";
printChars(decomposed); // Mitgefühl -- [M, i, t, g, e, f, u, ̈, h, l]
String precomposed = Normalizer.normalize(decomposed, Form.NFC);
printChars(precomposed); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]

// Normalizing with NFC again doesn't hurt:
String precomposedAgain = Normalizer.normalize(precomposed, Form.NFC);
printChars(precomposedAgain); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]
...

static void printChars(String s) {
  System.out.println(s + " -- " + Arrays.toString(s.toCharArray()));
}

As you can see, applying NFC to an already precomposed string doesn't hurt.

Note that printing the String will look correctly on any Unicode-capable terminal, only if you print the character array you see the difference between decomposed and precomposed form.

A possible source might be MacOS that tends to encode things in decomposed form, it's curious that Flickr doesn't normalize this stuff, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文