java中的编码转换

发布于 2024-07-08 19:28:10 字数 185 浏览 7 评论 0原文

是否有任何免费的java库可以用来将一种编码中的字符串转换为其他编码,例如 iconv? 我正在使用 Java 版本 1.3。

Is there any free java library which I can use to convert string in one encoding to other encoding, something like iconv? I'm using Java version 1.3.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

聽兲甴掵 2024-07-15 19:28:10

您不需要标准库之外的库 - 只需使用 字符集。 (您可以只使用 String 构造函数和 getBytes 方法,但我个人不喜欢只使用字符编码的名称。打字错误的空间太大。)

编辑:正如评论中指出的,您仍然可以使用 Charset 实例,但是易于使用 String 方法: 新字符串(字节,字符集)String.getBytes(charset)

请参阅“URL 编码(或:'那些是什么“%20" URL 中的代码?')"。

You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)

EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).

See "URL Encoding (or: 'What are those "%20" codes in URLs?')".

吖咩 2024-07-15 19:28:10

CharsetDecoder 应该是您正在寻找的,不是吗?

许多网络协议和文件使用面向字节的字符集存储其字符,例如 ISO-8859-1 (ISO-Latin-1)。
然而,Java 的本机字符编码是 Unicode UTF16BE(十六进制-bit UCS 转换格式,大端字节顺序)。

请参阅 <代码>字符集。 这并不意味着 UTF16 是默认字符集(即:默认的“十六位序列之间的映射 Unicode 代码单元 和字节序列"):

Java 虚拟机的每个实例都有一个默认字符集,它可能是也可能不是标准字符集之一。
[US-ASCIIISO-8859-1 又名 ISO-LATIN-1UTF-8、<代码>UTF-16BE、UTF-16LEUTF-16]
默认字符集是在虚拟机启动期间确定的,通常取决于底层操作系统使用的区域设置和字符集。

此示例演示如何将 ByteBuffer 中的 ISO-8859-1 编码字节转换为 CharBuffer 中的字符串,反之亦然。

// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

try {
    // Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
    // The new ByteBuffer is ready to be read.
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));

    // Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    // The new ByteBuffer is ready to be read.
    CharBuffer cbuf = decoder.decode(bbuf);
    String s = cbuf.toString();
} catch (CharacterCodingException e) {
}

CharsetDecoder should be what you are looking for, no ?

Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1 (ISO-Latin-1).
However, Java's native character encoding is Unicode UTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).

See Charset. That doesn't mean UTF16 is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes"):

Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII, ISO-8859-1 a.k.a. ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa.

// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

try {
    // Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
    // The new ByteBuffer is ready to be read.
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));

    // Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    // The new ByteBuffer is ready to be read.
    CharBuffer cbuf = decoder.decode(bbuf);
    String s = cbuf.toString();
} catch (CharacterCodingException e) {
}
平生欢 2024-07-15 19:28:10

我想补充一点,如果字符串最初使用错误的编码进行编码,则可能不可能在没有错误的情况下将其更改为另一种编码。
这个问题并没有说明这里的转换是从错误的编码到正确的编码,但我个人只是因为这种情况才偶然发现这个问题,所以也请注意其他人。

其他问题中的这个答案解释了为什么转换并不总是产生正确的结果
https://stackoverflow.com/a/2623793/4702806

I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors.
The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.

This answer in other question gives an explanation why the conversion does not always yield correct results
https://stackoverflow.com/a/2623793/4702806

任谁 2024-07-15 19:28:10

如果您将 unicode 视为一个字符集(实际上就是这样 - 它基本上是所有已知字符的编号集),那就容易多了。 您可以将其编码为 UTF-8(每个字符 1-3 个字节,具体取决于)或 UTF-16(每个字符 2 个字节或使用代理项对的 4 个字节)。

早在很久以前,Java 就曾使用 UCS-2 来对 unicode 字符集进行编码。 这只能处理每个字符 2 个字节,现在已过时。 添加代理对并升级到 UTF-16 是一个相当明显的黑客行为。

很多人认为他们一开始就应该使用 UTF-8。 无论如何,当 Java 最初编写时,unicode 已经远远超过 65535 个字符......

It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).

Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.

A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文