将 unicode 写入 rtf 文件

发布于 2024-12-12 04:29:06 字数 960 浏览 0 评论 0原文

我正在尝试将不同语言的字符串写入 rtf 文件。我尝试了一些不同的事情。 我在这里使用日语作为示例,但对于我尝试过的其他语言来说也是如此。

public void writeToFile(){

    String strJapanese = "日本語";
    DataOutputStream outStream;
    File file = new File("C:\\file.rtf");

    try{

        outStream = new DataOutputStream(new FileOutputStream(file));
        outStream.writeBytes(strJapanese);
        outStream.close();

    }catch (Exception e){
        System.out.println(e.toString());
    }
}

我也尝试过:

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

或者更具体地说:

byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);

输出流也有 writeUTF 方法:

outStream.writeUTF(strJapanese);

您可以通过 write 方法直接在输出流中使用 byte[] 。除了西欧语言之外,上述所有内容都给我提供了乱码。为了查看它是否有效,我尝试在记事本++中打开结果文档并设置适当的编码。我还使用了 OpenOffice,您可以在打开文档时选择编码和字体。

如果它确实有效,但我的计算机无法正常打开它,有什么方法可以检查吗?

I´m trying write strings in diffrent languages to a rtf file. I hav tried a few different things.
I use japanese here as an example but it´s the same for other languages i have tried.

public void writeToFile(){

    String strJapanese = "日本語";
    DataOutputStream outStream;
    File file = new File("C:\\file.rtf");

    try{

        outStream = new DataOutputStream(new FileOutputStream(file));
        outStream.writeBytes(strJapanese);
        outStream.close();

    }catch (Exception e){
        System.out.println(e.toString());
    }
}

I alse have tried:

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

Or more specific:

byte[] b = strJapanese.getBytes("Shift-JIS");
String output = new String(b);

The output stream also has the writeUTF method:

outStream.writeUTF(strJapanese);

You can use the byte[] directly in the output stream with the write method. All of the above gives me garbled characters for everything except west european languages. To see if it works I have tried opening the result document in notepad++ and set the appropriate encoding. Also i have used OpenOffice where you get to choose encoding and font when opening the document.

If it does work but my computer can´t open it properly, is there a way to check that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

云仙小弟 2024-12-19 04:29:06

默认情况下,JAVA 中的字符串采用 UTF-8(unicode)格式,但是当您想将其写下来时,需要指定

try {
    FileOutputStream fos = new FileOutputStream("test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
} catch (IOException e) {
    e.printStackTrace();
}

编码参考: http://download.oracle.com/javase/tutorial/i18n/text/stream.html

By default stings in JAVA are in UTF-8 (unicode), but when you want to write it down you need to specify encoding

try {
    FileOutputStream fos = new FileOutputStream("test.txt");
    Writer out = new OutputStreamWriter(fos, "UTF8");
    out.write(str);
    out.close();
} catch (IOException e) {
    e.printStackTrace();
}

ref: http://download.oracle.com/javase/tutorial/i18n/text/stream.html

赠我空喜 2024-12-19 04:29:06

您可以使用 \u 控制字编写以十进制数表示的任何 Unicode 字符。例如 \u1234? 将表示 Unicode 代码点为 1234 的字符,而 ? 是无法充分表示该字符的情况下的替换字符(例如,因为字体不不包含它)。

You can write any Unicode character expressed as its decimal number by using the \u control word. E.g. \u1234? would represent the character whose Unicode code point is 1234, and ? is the replacement character for cases where the character cannot be adequadely represented (e.g. because the font doesn't contain it).

谁的新欢旧爱 2024-12-19 04:29:06

数据输出流outStream;

您可能不需要 DataOutputStream 来写入 RTF 文件。 DataOutputStream 用于将二进制结构写入文件,但 RTF 是基于文本的。通常是 OutputStreamWriter,设置适当的字符集 构造函数中将是写入文本文件的方式。

outStream.writeBytes(str日语);

特别是,这会失败,因为 writeBytes 确实写入了字节,即使您向它传递了一个字符串。更合适的数据类型是 byte[],但这只是 Java 对字节与字符的处理令人困惑的地方之一。它将字符串转换为字节的方式很简单,就是获取每个 UTF-16 代码单元的低八位,然后丢弃其余部分。这会导致 ISO-8859-1 编码中所有不存在于 ISO-8859-1 中的字符出现乱码。

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

这并没有真正做任何有用的事情。您编码为 UTF-8 字节,然后使用默认字符集将其解码回字符串。触及默认字符集几乎总是错误的,因为它在不同的机器上是不可预测的。

outStream.writeUTF(strJapanese);

这将是编写 UTF-8 的一个更好的尝试,但它仍然不太正确,因为它使用了 Java 的伪造的“修改后的 UTF-8”编码,更重要的是 RTF 文件实际上并不支持 UTF-8,也不应该真正支持 UTF-8。直接包含任何非 ASCII 字符。

传统上,128 以上的非 ASCII 字符应写为十六进制字节转义符,如 \'80,并且它们的编码(如果有的话)以字体 \fcharset 和 \cpg 转义符处理起来非常非常烦人,并且不提供 UTF-8 作为选项之一。

在更现代的 RTF 中,您会得到 \u1234x 转义,如 Dabbler 的答案 (+1) 中所示。每个转义编码一个 UTF-16 代码单元,对应于一个 Java char,因此用正则表达式将所有非 ASCII 字符替换为其转义变体并不是太困难。

Word 97 及更高版本支持此功能,但其他一些工具可能会忽略 Unicode 并回退到 x 替换字符。

RTF 不是一种很好的格式。

DataOutputStream outStream;

You probably don't want a DataOutputStream for writing an RTF file. DataOutputStream is for writing binary structures to a file, but RTF is text-based. Typically an OutputStreamWriter, setting the appropriate charset in the constructor would be the way to write to text files.

outStream.writeBytes(strJapanese);

In particular this fails because writeBytes really does write bytes, even though you pass it a String. A much more appropriate datatype would have been byte[], but that's just one of the places where Java's handling of bytes vs chars is confusing. The way it converts your string to bytes is simply by taking the lower eight bits of each UTF-16 code unit, and throwing the rest away. This results in ISO-8859-1 encoding with garbled nonsense for all the characters that don't exist in ISO-8859-1.

byte[] b = strJapanese.getBytes("UTF-8");
String output = new String(b);

This doesn't really do anything useful. You encode to UTF-8 bytes and than decode that back to a String using the default charset. It's almost always a mistake to touch the default charset as it is unpredictable over different machines.

outStream.writeUTF(strJapanese);

This would be a better stab at writing UTF-8, but it's still not quite right as it uses Java's bogus “modified UTF-8” encoding, and more importantly RTF files don't actually support UTF-8, and shouldn't really directly include any non-ASCII characters at all.

Traditionally non-ASCII characters from 128 upwards should be written as hex bytes escapes like \'80, and the encoding for them is specified, if it is at all, in font \fcharset and \cpg escapes that are very, very annoying to deal with, and don't offer UTF-8 as one of the options.

In more modern RTF, you get \u1234x escapes as in Dabbler's answer (+1). Each escape encodes one UTF-16 code unit, which corresponds to a Java char, so it's not too difficult to regex-replace all non-ASCII characters with their escaped variants.

This is supported by Word 97 and later but some other tools may ignore the Unicode and fall back to the x replacement character.

RTF is not a very nice format.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文