在 Java 中将 UTF-8 转换为 ISO-8859-1

发布于 2024-07-30 13:11:19 字数 807 浏览 5 评论 0原文

我正在读取 XML 文档 (UTF-8),并最终使用 ISO-8859-1 在网页上显示内容。 正如预期的那样,有一些字符没有正确显示,例如 '(它们显示为?)。

是否可以将这些字符从 UTF-8 转换为 ISO-8859-1?

下面是我为尝试此操作而编写的一段代码:

BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();

String line = null;
while ((line = br.readLine()) != null) {
  sb.append(line);
}
br.close();

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

我不太确定出了什么问题,但我相信是 readLine() 导致了悲伤(因为字符串将采用 Java/UTF-16 编码?)。 我尝试的另一个变体是将 latin1 替换为

byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");

我已阅读过有关该主题的先前帖子,并且我正在学习。 在此先感谢您的帮助。

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as , and (they display as ?).

Is it possible to convert these characters from UTF-8 to ISO-8859-1?

Here is a snippet of code I have written to attempt this:

BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();

String line = null;
while ((line = br.readLine()) != null) {
  sb.append(line);
}
br.close();

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with

byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");

I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

丶视觉 2024-08-06 13:11:19

我不确定标准库中是否有标准化例程可以执行此操作。 我不认为“智能”引号的转换是由标准 Unicode 标准化器 例程 - 但不要引用我的话。

明智的做法是转储 ISO-8859-1 并开始使用 <代码>UTF-8。 也就是说,可以将任何通常允许的 Unicode 代码点编码到编码为 ISO-8859-1 的 HTML 页面中。 您可以使用转义序列对它们进行编码,如下所示:

public final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

示例用法:

String foo = "This is Cyrillic Ya: \u044F\n"
    + "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

上面,字符左双引号 ( U+201C ) 被编码为 “。 几个其他任意代码点也被类似地编码。

使用这种方法需要小心。 如果您的文本需要转义为 HTML,则需要在上述代码或与符号最终转义之前完成。

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

public final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

Example usage:

String foo = "This is Cyrillic Ya: \u044F\n"
    + "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C ) is encoded as “. A couple of other arbitrary code points are likewise encoded.

Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

输什么也不输骨气 2024-08-06 13:11:19

根据您的默认编码,以下行可能会导致问题,

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

在 Java 中,String/Char 始终采用 UTF-16BE。 仅当将字符转换为字节时才涉及不同的编码。 假设您的默认编码是 UTF-8,latin1 缓冲区被视为 UTF-8,并且某些 Latin-1 序列可能形成无效的 UTF-8 序列,您将得到 ?。

Depending on your default encoding, following lines could cause problem,

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.

小情绪 2024-08-06 13:11:19

使用 Java 8,McDowell 的答案可以像这样简化(同时保留代理对的正确处理):

public final class HtmlEncoder {
    private HtmlEncoder() {
    }

    public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
                                                          T out) throws java.io.IOException {
        for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
            int codePoint = iterator.nextInt();
            if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
                out.append((char) codePoint);
            } else {
                out.append("&#x");
                out.append(Integer.toHexString(codePoint));
                out.append(";");
            }
        }
        return out;
    }
}

With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):

public final class HtmlEncoder {
    private HtmlEncoder() {
    }

    public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
                                                          T out) throws java.io.IOException {
        for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
            int codePoint = iterator.nextInt();
            if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
                out.append((char) codePoint);
            } else {
                out.append("&#x");
                out.append(Integer.toHexString(codePoint));
                out.append(";");
            }
        }
        return out;
    }
}
我不是你的备胎 2024-08-06 13:11:19

当实例化 String 对象时,您需要指示要使用哪种编码。

因此将 : 替换

return new String(latin1);

return new String(latin1, "ISO-8859-1");

when you instanciate your String object, you need to indicate which encoding to use.

So replace :

return new String(latin1);

by

return new String(latin1, "ISO-8859-1");
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文