CSV 中的字符编码 UTF 和 ISO-8859-1

发布于 2024-10-06 06:39:51 字数 856 浏览 1 评论 0原文

可能的重复:
如何在java中添加UTF-8 BOM

我的oracle数据库的字符集是UTF8。 我有一个 Java 存储过程,它从表中获取记录并创建一个 csv 文件。

BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);
ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");

如果我使用上面的代码,德文字符(从表中获取)在 csv 中会变成乱码。但是,如果我将编码更改为使用 ISO-8859-1,那么我可以在 csv 文件中正确看到德语字符。

PrintStream out = new PrintStream(zipOut,false,"ISO-8859-1");

我读过一些帖子,其中说我们应该使用 UTF8,因为它是安全的,并且还可以正确编码其他语言(中文等),而 ISO-8859-1 将无法做到这一点。

请建议我应该使用哪种编码。 (将来我们很有可能在表中存储中文/日文单词。)

Possible Duplicate:
How to add a UTF-8 BOM in java

My oracle database has a character set of UTF8.
I have a Java stored procedure which fetches record from the table and creates a csv file.

BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);
ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");

The german characters(fetched from the table) becomes gibberish in the csv if I use the above code. But if I change the encoding to use ISO-8859-1, then I can see the german characters properly in the csv file.

PrintStream out = new PrintStream(zipOut,false,"ISO-8859-1");

I have read in some posts which says that we should use UTF8 as it is safe and will also encode other language (chinese etc) properly which ISO-8859-1 will fail to do so.

Please suggest me which encoding I should use. (There are strong chances that we might have chinese/japanese words stored in the table in the future.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

哆啦不做梦 2024-10-13 06:39:51

您目前仅讨论本质上是双面的流程的一部分。

将某些内容编码为字节仅在某些其他进程出现并在稍后某个时刻将其解码回文本的意义上才是真正相关的。当然,两个进程需要使用相同的字符集,否则解码将失败。

因此,在我看来,将 BLOB 从数据库中取出并放入 CSV 文件的过程是假设字节是 ISO-8859-1 文本编码。因此,如果将它们存储为 UTF-8,解码会很混乱(尽管基本 ASCII 字符在两者中具有相同的字节表示形式,这就是它们仍然正确解码的原因)。

UTF-8几乎在所有情况下都可以使用的良好字符集,但它还不够神奇,不足以克服必须使用与编码相同的字符集进行解码的不变法则。因此,您可以将 CSV 创建器更改为使用 UTF-8 进行解码,否则您必须继续使用 ISO-8859-1 进行编码。

You're currently only talking about one part of a process that is inherently two-sided.

Encoding something to bytes is only really relevant in the sense that some other process comes along and decodes it back into text at some later point. And of course, both processes need to use the same character set else the decode will fail.

So it sounds to me that the process that takes the BLOB out of the database and into the CSV file, is assuming that the bytes are an ISO-8859-1 encoding of text. Hence if you store them as UTF-8, the decoding messes (though the basic ASCII characters have the same byte representation in both, which is why they still decode correctly).

UTF-8 is a good character set to use in almost all circumstances, but it's not magic enough to overcome the immutable law that the same character set must be used for decoding as was used for encoding. So you can either change your CSV-creator to decode with UTF-8, else you'll have to continue encoding with ISO-8859-1.

冷弦 2024-10-13 06:39:51

我想您的 BLOB 数据是 ISO-8859-1 编码的。由于它存储为二进制而不是文本,因此它的编码不依赖于数据库编码。您应该检查 BLOB 是否最初是用 UTF-8 编码编写的,如果不是,请执行此操作。

I suppose your BLOB data is ISO-8859-1 encoded. As it's stored as binary and not as text its encoding is not depended on the databases encoding. You should check if the the BLOB was originaly written in UTF-8 encoding and if not, do so.

我要还你自由 2024-10-13 06:39:51

我认为问题是 [Excel]csv 无法找出 utf8 编码。
utf-8 csv问题

但我仍然没有即使我将 BOM 放在 PrintStream 上也能解决该问题。

PrintStream out = new PrintStream(zipOut,false,"UTF-8"); 
out.write('\ufeff');

我也尝试过:

out.write(new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF });

但没有成功。

I think the problem is [Excel]csv could not figure out the utf8 encoding.
utf-8 csv issue

But I m still not able to resolve the issue even if I put a BOM on the PrintStream.

PrintStream out = new PrintStream(zipOut,false,"UTF-8"); 
out.write('\ufeff');

I also tried:

out.write(new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF });

but to no avail.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文