当前位置：文江博客话题详情

使用java和UTF-16LE到UTF-8转换打开xls文件并将其保存为tsv文件

发布于 2025-01-07 15:14:17 字数 156 浏览 3 评论 0原文

我有两个问题：

有没有一种方法可以通过Java打开xls文件并将其另存为tsv文件？编辑：或者有没有一种方法可以通过Java将xls文件转换为tsv文件？

有没有一种方法可以使用 java 将 UTF-16LE 文件转换为 UTF-8 ？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

对不⑦ 2025-01-14 15:14:17

我有两个问题：

在 StackOverflow 上，您应该将其分为两个不同的问题...

我将回答您的第二个问题：

有没有一种方法可以将 UTF-16LE 文件转换为 UTF-8 使用
java？

是的当然。而且方法不止一种。

基本上，您想要读取指定输入编码（UTF-16LE）的输入文件，然后写入指定输出编码（UTF-8）的文件。

假设你有一些 UTF-16LE 编码的文件：

... $ file testInput.txt 
testInput.txt: Little-endian UTF-16 Unicode character data

然后你基本上可以在 Java 中执行类似的操作（这只是一个示例：你需要填写缺少的异常处理代码，也许不在末尾添加最后一个换行符，也许丢弃BOM（如果有）等）：

    FileInputStream fis = new FileInputStream(new File("/home/.../testInput.txt") );
    InputStreamReader isr = new InputStreamReader( fis, Charset.forName("UTF-16LE") );
    BufferedReader br = new BufferedReader( isr );
    FileOutputStream fos = new FileOutputStream(new File("/home/.../testOutput.txt"));
    OutputStreamWriter osw = new OutputStreamWriter( fos, Charset.forName("UTF-8") );
    BufferedWriter bw = new BufferedWriter( osw );
    String line = null;
    while ( (line = br.readLine()) != null ) {
        bw.write(line);
        bw.newLine();   // will add an unnecessary newline at the end of your file, fix this
    }
    bw.flush();
    // take care of closing the streams here etc.

这将创建一个 UTF-8 编码的文件。

$ file testOutput.txt 
testOutput.txt: UTF-8 Unicode (with BOM) text

使用例如 hexdump 可以清楚地看到 BOM：

 $ hexdump testOutput.txt -C
00000000  ef bb bf ... (snip)

BOM 在 UTF-8 中以三个字节编码 (ef bb fb)，而在 UTF-16 中则以两个字节编码。在 UTF16-LE 中，BOM 如下所示：

$ hexdump testInput.txt -C
00000000  ff fe ... (snip)

请注意，UTF-8 编码的文件可能有也可能没有（两者都完全有效）有“BOM”（字节顺序掩码）。 UTF-8 文件中的 BOM 并不那么愚蠢：您不关心字节顺序，但它可以帮助快速识别文本文件是否是 UTF-8 编码的。根据 Unicode 规范，带有 BOM 的 UTF-8 文件是完全合法的，因此无法处理以 BOM 开头的 UTF-8 文件的读者会被破坏。简单明了。

如果由于某种原因您正在使用无法处理 BOM 的损坏的 UTF-8 阅读器，那么您可能需要在将第一个字符串写入磁盘之前将其删除。

有关 BOM 的更多信息，请访问：

http://unicode.org/faq/utf_bom.html

I've two questions:

On StackOverflow you should split that into two different questions...

I'll answer your second question:

Is there a way in which we can convert a UTF-16LE file to UTF-8 using
java?

Yes of course. And there's more than one way.

Basically you want to read your input file specifying the input encoding (UTF-16LE) and then write the file specifying the output encoding (UTF-8).

Say you have some UTF-16LE encoded file:

... $ file testInput.txt 
testInput.txt: Little-endian UTF-16 Unicode character data

You then basically could do something like this in Java (it's just an example: you'll want to fill in missing exception handling code, maybe not put a last newline at the end, maybe discard the BOM if any, etc.):

    FileInputStream fis = new FileInputStream(new File("/home/.../testInput.txt") );
    InputStreamReader isr = new InputStreamReader( fis, Charset.forName("UTF-16LE") );
    BufferedReader br = new BufferedReader( isr );
    FileOutputStream fos = new FileOutputStream(new File("/home/.../testOutput.txt"));
    OutputStreamWriter osw = new OutputStreamWriter( fos, Charset.forName("UTF-8") );
    BufferedWriter bw = new BufferedWriter( osw );
    String line = null;
    while ( (line = br.readLine()) != null ) {
        bw.write(line);
        bw.newLine();   // will add an unnecessary newline at the end of your file, fix this
    }
    bw.flush();
    // take care of closing the streams here etc.

This shall create a UTF-8 encoded file.

$ file testOutput.txt 
testOutput.txt: UTF-8 Unicode (with BOM) text

The BOM can clearly be seen using, for example, hexdump:

 $ hexdump testOutput.txt -C
00000000  ef bb bf ... (snip)

The BOM is encoded on three bytes in UTF-8 (ef bb fb) while it's encoded on two bytes in UTF-16. In UTF16-LE the BOM looks like this:

$ hexdump testInput.txt -C
00000000  ff fe ... (snip)

Note that UTF-8 encoded files may or may not (both are totally valid) have a "BOM" (byte order mask). A BOM in a UTF-8 file is not that silly: you don't care about the byte order but it can help quickly identify a text file as being UTF-8 encoded. UTF-8 files with a BOM are fully legit according to the Unicode specs and hence readers unable to deal with UTF-8 files starting with a BOM are broken. Plain and simple.

If for whatever reason you're working with broken UTF-8 readers unable to cope with BOMs, then you may want to remove the BOM from the first String before writing it to disk.

More infos on BOMs here:

http://unicode.org/faq/utf_bom.html

回复收藏 0 原文