在 Java 中向 zip 添加非 ASCII 文件名

发布于 2024-07-06 10:53:28 字数 1835 浏览 9 评论 0原文

使用Java非ASCII文件名添加到zip文件的最佳方法是什么,这样文件就可以正确地在WindowsLinux中阅读?

这是一种尝试,改编自https://truezip.dev.java.net/tutorial-6.html#Example,它在 Windows Vista 中工作,但在 Ubuntu Hardy 中失败。 在 Hardy 中,文件名在 file-roller 中显示为 abc-ЖДФ.txt。

import java.io.IOException;
import java.io.PrintStream;

import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;

public class Main {

    public static void main(final String[] args) throws IOException {

        try {
            PrintStream ps = new PrintStream(new FileOutputStream(
                    "outer.zip/abc-åäö.txt"));
            try {
                ps.println("The characters åäö works here though.");
            } finally {
                ps.close();
            }
        } finally {
            File.umount();
        }
    }
}

与 java.util.zip 不同,truezip 允许指定 zip 文件编码。 这是另一个示例,这次明确指定了编码。 IBM437、UTF-8 和 ISO-8859-1 在 Linux 中均不起作用。 IBM437 在 Windows 下运行。

import java.io.IOException;

import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;

public class Main {

    public static void main(final String[] args) throws IOException {

        for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
            ZipOutputStream zipOutput = new ZipOutputStream(
                    new FileOutputStream(encoding + "-example.zip"), encoding);
            ZipEntry entry = new ZipEntry("abc-åäö.txt");
            zipOutput.putNextEntry(entry);
            zipOutput.closeEntry();
            zipOutput.close();
        }
    }
}

What is the best way to add non-ASCII file names to a zip file using Java, in such a way that the files can be properly read in both Windows and Linux?

Here is one attempt, adapted from https://truezip.dev.java.net/tutorial-6.html#Example, which works in Windows Vista but fails in Ubuntu Hardy. In Hardy the file name is shown as abc-ЖДФ.txt in file-roller.

import java.io.IOException;
import java.io.PrintStream;

import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;

public class Main {

    public static void main(final String[] args) throws IOException {

        try {
            PrintStream ps = new PrintStream(new FileOutputStream(
                    "outer.zip/abc-åäö.txt"));
            try {
                ps.println("The characters åäö works here though.");
            } finally {
                ps.close();
            }
        } finally {
            File.umount();
        }
    }
}

Unlike java.util.zip, truezip allows specifying zip file encoding. Here's another sample, this time explicitly specifiying the encoding. Neither IBM437, UTF-8 nor ISO-8859-1 works in Linux. IBM437 works in Windows.

import java.io.IOException;

import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;

public class Main {

    public static void main(final String[] args) throws IOException {

        for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
            ZipOutputStream zipOutput = new ZipOutputStream(
                    new FileOutputStream(encoding + "-example.zip"), encoding);
            ZipEntry entry = new ZipEntry("abc-åäö.txt");
            zipOutput.putNextEntry(entry);
            zipOutput.closeEntry();
            zipOutput.close();
        }
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

灰色世界里的红玫瑰 2024-07-13 10:53:29

它实际上失败了还是只是字体问题? (例如,对于这些字符码,字体具有不同的字形)我在 Windows 中看到了类似的问题,其中渲染“中断”,因为字体不支持字符集,但数据实际上是完整且正确的。

Did it actually fail or was just a font issue? (e.g. font having different glyphs for those charcodes) I've seen similar issues in Windows where rendering "broke" because the font didn't support the charset but the data was actually intact and correct.

书间行客 2024-07-13 10:53:29

非 ASCII 文件名在 ZIP 实现中并不可靠,最好避免使用。 没有在 ZIP 文件中存储字符集设置的规定; 客户倾向于猜测“当前系统代码页”,这不太可能是您想要的。 客户端和代码页的许多组合可能会导致文件无法访问。

对不起!

Non-ASCII file names are not reliable across ZIP implementations and are best avoided. There is no provision for storing a charset setting in ZIP files; clients tend to guess with 'the current system codepage', which is unlikely to be what you want. Many combinations of client and codepage can result in inaccessible files.

Sorry!

奈何桥上唱咆哮 2024-07-13 10:53:28

ZIP 中文件条目的编码最初指定为 IBM Code Page 437。其他语言中使用的许多字符无法以这种方式使用。

PKWARE 规范 提到了该问题并添加了一些内容。 但这是后来添加的(从 2007 年开始,感谢 Cheeso 澄清了这一点,请参阅评论)。 如果设置了该位,则文件名条目必须以 UTF-8 进行编码。 此扩展在链接文档末尾的“附录 D - 语言编码 (EFS)”中进行了描述。

对于 Java 来说,这是一个已知的错误,会遇到非 ASCII 字符的麻烦。 请参阅 bug #4244499 以及大量相关错误。

我的同事在将文件名存储到 ZIP 之前使用 URL 编码作为解决方法,并在读取文件后进行解码。 如果您同时控制存储和读取,这可能是一个解决方法。

编辑:在该错误中,有人建议使用 Apache Ant 的 ZipOutputStream 作为解决方法。 该实现允许指定编码。

The encoding for the File-Entries in ZIP is originally specified as IBM Code Page 437. Many characters used in other languages are impossible to use that way.

The PKWARE-specification refers to the problem and adds a bit. But that is a later addition (from 2007, thanks to Cheeso for clearing that up, see comments). If that bit is set, the filename-entry have to be encoded in UTF-8. This extension is described in 'APPENDIX D - Language Encoding (EFS)', that is at the end of the linked document.

For Java it is a known bug, to get into trouble with non-ASCII-characters. See bug #4244499 and the high number of related bugs.

My colleague used as workaround URL-Encoding for the filenames before storing them into the ZIP and decoding after reading them. If you control both, storing and reading, that may be a workaround.

EDIT: At the bug someone suggests using the ZipOutputStream from Apache Ant as workaround. This implementation allows the specification of an encoding.

听你说爱我 2024-07-13 10:53:28

在Zip文件中,根据PKWare拥有的规范,文件名和文件注释的编码是IBM437。 2007 年,PKWare 扩展了规范,也允许使用 UTF-8。 这没有说明 zip 中包含的文件的编码。 仅文件名的编码。

我认为所有工具和库(Java 和非 Java)都支持 IBM437(它是 ASCII 的超集),支持 UTF-8 的工具和库较少。 某些工具和库支持其他代码页。 例如,如果您在上海运行的计算机上使用 WinRar 压缩某些内容,您将获得 Big5 代码页。 邮政编码规范不“允许”这种情况,但无论如何它都会发生。

.NET 的 DotNetZip 库支持 Unicode,但是如果您使用 Java,这当然对您没有帮助!

使用 Java 对 ZIP 的内置支持,您将始终获得 IBM437。 如果您想要 IBM437 以外的存档,请使用第三方库,或创建一个 JAR。

In Zip files, according to the spec owned by PKWare, the encoding of file names and file comments is IBM437. In 2007 PKWare extended the spec to also allow UTF-8. This says nothing about the encoding of the files contained within the zip. Only the encoding of the filenames.

I think all tools and libraries (Java and non Java) support IBM437 (which is a superset of ASCII), and fewer tools and libraries support UTF-8. Some tools and libs support other code pages. For example if you zip something using WinRar on a computer running in Shanghai, you will get the Big5 code page. This is not "allowed" by the zip spec but it happens anyway.

The DotNetZip library for .NET does Unicode, but of course that doesn't help you if you are using Java!

Using the Java built-in support for ZIP, you will always get IBM437. If you want an archive with something other than IBM437, then use a third party library, or create a JAR.

忆依然 2024-07-13 10:53:28

奇迹确实发生了,Sun/Oracle 确实修复了长期存在的 bug/rfe:

现在可以 在创建 zip 文件/流时设置文件名编码(需要 Java 7)。

Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:

Now it's possible to set up filename encodings upon creating the zip file/stream (requires Java 7).

錯遇了你 2024-07-13 10:53:28

您仍然可以使用 zip 流的 Apache Commons 实现: http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.html#setEncoding%28java.lang。 String%29

在您的流上调用 setEncoding("UTF-8") 应该足够了。

You can still use the Apache Commons implementation of the zip stream : http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.html#setEncoding%28java.lang.String%29

Calling setEncoding("UTF-8") on your stream should be enough.

绮筵 2024-07-13 10:53:28

快速浏览一下 TrueZIP 手册 - 他们推荐 JAR 格式:

它使用UTF-8进行文件名编码
和评论 - 与 ZIP 不同,ZIP 仅
使用IBM437。

这可能意味着 API 正在使用 java.util.zip 包用于其实现; 该文档指出它仍然使用 ZIP 格式1996 年。 直到 2006 年,PKWARE .ZIP 文件格式规范才添加 Unicode 支持。

From a quick look at the TrueZIP manual - they recommend the JAR format:

It uses UTF-8 for file name encoding
and comments - unlike ZIP, which only
uses IBM437.

This probably means that the API is using the java.util.zip package for its implementation; that documentation states that it is still using a ZIP format from 1996. Unicode support wasn't added to the PKWARE .ZIP File Format Specification until 2006.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文