当前位置：文江博客话题详情

String encoding Java

Java String.codePointAt 返回意外值

发布于 2024-09-27 21:45:02 字数 495 浏览 4 评论 0 原文

如果我使用 33 到 127 之间的任何 ASCII 字符，codePointAt 方法会给出正确的十进制值，例如：

String s1 = new String("#");
int val = s1.codePointAt(0);

这会返回 35，这是正确的值。

但是，如果我尝试使用 128 到 255 之间的 ASCII 字符（扩展 ASCII/ISO-8859-1），此方法会给出错误的值，例如：

String s1 = new String("ƒ")  // Latin small letter f with hook
int val = s1.codePointAt(0);

这应该根据这个引用表，却返回409，这是为什么呢？

原文

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:

String s1 = new String("#");
int val = s1.codePointAt(0);

This returns 35 which is the correct value.

But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:

String s1 = new String("ƒ")  // Latin small letter f with hook
int val = s1.codePointAt(0);

This should return 159 as per this reference table, but instead returns 409, why is this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羅雙樹 2024-10-04 21:45:02

但是如果我尝试使用 128 到 255 之间的 ASCII 字符

ASCII 没有此范围内的值。它仅使用 7 位。

Java 字符是 UTF-16（没有别的！）。如果要使用Java表示ASCII，则需要使用字节数组。

codePointAt 方法返回 32 位代码点。 16 位字符不能包含整个 Unicode 范围，因此某些代码点必须分成两个字符（根据 UTF-16 的编码方案）。 codePointAt 方法有助于解析为字符代码点。

我在这里编写了 Java 编码的粗略指南 here 。

回复收藏 0 原文

笑咖 2024-10-04 21:45:02

Java 字符未采用 ISO-8859-1 进行编码。它们使用 UTF-16，它对于 7 位 ASCII 字符具有相同的值（仅限 0-127 的值）。

要获得 ISO-8859-1 的正确值，您必须使用 ~~String.getBytes("ISO-8859-1");~~ 将字符串转换为 byte[]并查看字节数组。

更新

ISO-8859-1 不是扩展 ASCII 编码，请使用String.getBytes("Cp437"); 获取正确的值。

回复收藏 0 原文

自由如风 2024-10-04 21:45:02

在统一码中

ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK

in Unicode

ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK

回复收藏 0 原文

岁月静好 2024-10-04 21:45:02

String.codePointAt 返回此指定索引处的 Unicode 代码点。

f 的 Unicode 代码点是 402，请参阅

http://www.decodeunicode.org/de /u+0192/properties

因此

System.out.println("ƒ".codePointAt(0));

打印 402 是正确的。

如果您对其他字符集中的表示感兴趣，可以通过 getBytes(String charsetName):

    final String s = "ƒ";
    for (final String csName : Charset.availableCharsets().keySet()) {
    try {
     final Charset cs = Charset.forName(csName);
     final CharsetEncoder encode = cs.newEncoder();
     if (encode.canEncode(s)) 
          {
      System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
          }
        } catch (final UnsupportedOperationException uoe) {
        } catch (final UnsupportedEncodingException e) {
     }
    }

String.codePointAt returns the Unicode-Codepoint at this specified index.

The Unicode-Codepoint of ƒ is 402, see

http://www.decodeunicode.org/de/u+0192/properties

System.out.println("ƒ".codePointAt(0));

printing 402 is correct.

If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):

    final String s = "ƒ";
    for (final String csName : Charset.availableCharsets().keySet()) {
    try {
     final Charset cs = Charset.forName(csName);
     final CharsetEncoder encode = cs.newEncoder();
     if (encode.canEncode(s)) 
          {
      System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
          }
        } catch (final UnsupportedOperationException uoe) {
        } catch (final UnsupportedEncodingException e) {
     }
    }

回复收藏 0 原文

~没有更多了~