当前位置：文江博客话题详情

GSON / JSON：奇怪的特殊字符（元音变音）问题

发布于 2024-12-11 13:37:20 字数 711 浏览 2 评论 0原文

在尝试使用 GSON 处理 JSON 响应时（输出来自 flickr API，以防您询问），我遇到了我所描述的某些特殊字符的非常奇怪的编码：

原始 JSON 响应

这是其十六进制视图：

十六进制视图原始 JSON 响应

“u”后跟“双点”应该是德语“ü”，这就是我的困惑开始的地方。就好像有人把这个字符撕成两半，对两部分分别进行编码。下图显示了我期望的十六进制编码，以防“ü”正确编码：

Expected Hex View

更奇怪的是，在我预计会出现问题的情况下（即亚洲字符集），一切似乎都工作正常，例如“title”：“ナガreテユク・・・”

问题：

是一些 flickrAPI 的奇怪之处或响应的正确 JSON 编码？或者它是正确编码的 JSON，而 GSON 未能将此响应“重新组装”为原始“ü”。还是标题信息的作者只是搞砸了？
我该如何解决这个问题（如果是 JSON 或 GSON 搞乱了，如果是作者的话显然不能做任何事情）。我如何知道哪些“其他”字符受到影响（我想到了 ö 和 ä，但可能还有更多“特殊情况”）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

始于初秋 2024-12-18 13:37:20

您所看到的情况是 Unicode 分解：

像德语元音变音这样的字符可以是有两种表达方式：

更传统的预组合形式为单个字符 ü 或
分解形式为基本字符 u 后跟结合分音符 ̈_ （我不得不使用这里的下划线让它显示出来，因为它不应该单独存在，它实际上只是“悬停点”）

如果你收到这样的东西，它可以通过使用 java.text.Normalizer （自 Java 1.6 起可用）：

String decomposed = "Mitgef\u0308hl";
printChars(decomposed); // Mitgefühl -- [M, i, t, g, e, f, u, ̈, h, l]
String precomposed = Normalizer.normalize(decomposed, Form.NFC);
printChars(precomposed); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]

// Normalizing with NFC again doesn't hurt:
String precomposedAgain = Normalizer.normalize(precomposed, Form.NFC);
printChars(precomposedAgain); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]
...

static void printChars(String s) {
  System.out.println(s + " -- " + Arrays.toString(s.toCharArray()));
}

如您所见，应用 NFC 到已经预先组合好的字符串不会有什么坏处。

请注意，打印 String 在任何支持 Unicode 的终端上都会正确显示，只有当您打印字符数组时，您才能看到分解形式和预组合形式之间的差异。

一个可能的来源可能是 MacOS，它倾向于以分解的形式对事物进行编码，但奇怪的是 Flickr 并没有规范化这些东西。

What you're seeing there is a case of Unicode decomposition:

Characters like German umlauts can be expressed in two ways:

the more traditional precomposed form as a single character ü or
in decomposed form as base character u followed by a combining diaeresis ̈_ (I had to use an underscore here to make it show up because it's not supposed to stand alone, it's really just the to "hovering dots")

If you receive something like this, it's easily converted into precomposed form by using java.text.Normalizer (available since Java 1.6):

String decomposed = "Mitgef\u0308hl";
printChars(decomposed); // Mitgefühl -- [M, i, t, g, e, f, u, ̈, h, l]
String precomposed = Normalizer.normalize(decomposed, Form.NFC);
printChars(precomposed); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]

// Normalizing with NFC again doesn't hurt:
String precomposedAgain = Normalizer.normalize(precomposed, Form.NFC);
printChars(precomposedAgain); // Mitgefühl -- [M, i, t, g, e, f, ü, h, l]
...

static void printChars(String s) {
  System.out.println(s + " -- " + Arrays.toString(s.toCharArray()));
}

As you can see, applying NFC to an already precomposed string doesn't hurt.

Note that printing the String will look correctly on any Unicode-capable terminal, only if you print the character array you see the difference between decomposed and precomposed form.

A possible source might be MacOS that tends to encode things in decomposed form, it's curious that Flickr doesn't normalize this stuff, though.

回复收藏 0 原文

~没有更多了~