File.listFiles() 使用 JDK 6 处理 unicode 名称(Unicode 规范化问题)

发布于 2024-09-16 21:02:33 字数 1776 浏览 12 评论 0原文

在 OS X 和 Linux 上列出 Java 6 中的目录内容时,我遇到了一个奇怪的文件名编码问题:File.listFiles() 和相关方法似乎以不同的编码返回文件名比系统的其余部分。

请注意,给我带来问题的不仅仅是这些文件名的显示。我主要对文件名与远程文件存储系统进行比较感兴趣,因此我更关心名称字符串的内容,而不是用于打印输出的字符编码。

这是一个演示程序。它创建一个具有 Unicode 名称的文件,然后打印从直接创建的文件中获取的文件名的 URL 编码 版本,以及在父目录下列出的相同文件(您应该运行此代码在空目录中)。结果显示 File.listFiles() 方法返回的不同编码。

String fileName = "Trîcky Nåme";
File file = new File(fileName);
file.createNewFile();
System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));

// Get parent (current) dir and list file contents
File parentDir = file.getAbsoluteFile().getParentFile();
File[] children = parentDir.listFiles();
for (File child: children) {
    System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
}

这是我在系统上运行此测试代码时得到的结果。请注意 %CC%C3 字符表示。

OS X Snow Leopard:

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)

KUbuntu Linux(在同一 OS X 系统上的虚拟机中运行):

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)

我尝试了各种技巧来使字符串一致,包括设置 file.encoding 系统属性和各种 LC_CTYPELANG 环境变量。没有任何帮助,我也不想诉诸这样的黑客。

这个(有些相关?)问题不同,我能够从尽管名称奇怪,但列出的文件

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related methods seem to return file names in a different encoding than the rest of the system.

Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.

Here is a program to demonstrate. It creates a file with a Unicode name then prints out URL-encoded versions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the File.listFiles() method.

String fileName = "Trîcky Nåme";
File file = new File(fileName);
file.createNewFile();
System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));

// Get parent (current) dir and list file contents
File parentDir = file.getAbsoluteFile().getParentFile();
File[] children = parentDir.listFiles();
for (File child: children) {
    System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
}

Here's what I get when I run this test code on my systems. Note the %CC versus %C3 character representations.

OS X Snow Leopard:

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)

KUbuntu Linux (running in a VM on same OS X system):

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)

I have tried various hacks to get the strings to agree, including setting the file.encoding system property and various LC_CTYPE and LANG environment variables. Nothing helps, nor do I want to resort to such hacks.

Unlike this (somewhat related?) question, I am able to read data from the listed files despite the odd names

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

讽刺将军 2024-09-23 21:02:34

我怀疑您只需要指示 javac 使用什么编码来编译包含特殊字符的 .java 文件,因为您已将其硬编码在源文件中。否则将使用平台默认编码,它可能根本不是 UTF-8。

您可以使用 VM 参数 -encoding 来实现此目的。

javac -encoding UTF-8 com/example/Foo.java

这样,生成的 .class 文件最终将包含正确的字符,您也将能够创建并列出正确的文件名。

I suspect that you just have to instruct javac what encoding to use to compile the .java file containing the special characters with since you've hardcoded it in the source file. Otherwise the platform default encoding will be used, which may not be UTF-8 at all.

You can use the VM argument -encoding for this.

javac -encoding UTF-8 com/example/Foo.java

This way the resulting .class file will end up containing the correct characters and you will be able to create and list the correct filename as well.

咆哮 2024-09-23 21:02:34

另一种解决方案是使用新的 java.nio.Path api 代替 java.io.File api,它工作得很好。

An alternative solution is to use the new java.nio.Path api in place of the java.io.File api which works perfectly.

孤独岁月 2024-09-23 21:02:33

使用 Unicode,有不止一种有效的方式来表示同一个字母。
您在 Tricky Name 中使用的字符是“带扬抑符的拉丁小写字母 i”和“上面带环的拉丁小写字母 a”。

你说“注意 %CC%C3 字符表示”,但仔细观察你会看到序列

i 0xCC 0x82 vs. 0xC3 0xAE
a 0xCC 0x8A vs. 0xC3 0xA5

也就是说,第一个是字母 i 后跟 0xCC82,它是 Unicode< 的 UTF-8 编码code>\u0302“组合扬抑符”字符,而第二个字符是 \u00EE“带抑扬符的拉丁小写字母 i”。同样,对于另一对,第一个是字母 a,后跟 0xCC8A(“组合上面的环”字符),第二个是“上面有环的拉丁小写字母 a”。这两种格式都是有效 Unicode 字符串的有效 UTF-8 编码,但一种采用“组合”格式,另一种采用“分解”格式。

OS X HFS Plus 卷将字符串(例如文件名)存储为“完全分解”。 Unix 文件系统实际上是根据文件系统驱动程序选择的存储方式来存储的。您不能对不同类型的文件系统做出任何笼统的声明。

请参阅有关 Unicode 等效 的 Wikipedia 文章,了解组合形式与分解形式的一般讨论,其中提到了 OS X具体来说。

请参阅 Apple 的技术问答 QA1235(Objective-C 语言)不幸的是)有关转换表单的信息。

关于 Apple java-dev 邮件的最近的电子邮件主题清单可能对您有一些帮助。

基本上,您需要将分解形式规范化为组合形式,然后才能比较字符串。

Using Unicode, there is more than one valid way to represent the same letter.
The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

You say "Note the %CC versus %C3 character representations", but looking closer what you see are the sequences

i 0xCC 0x82 vs. 0xC3 0xAE
a 0xCC 0x8A vs. 0xC3 0xA5

That is, the first is letter i followed by 0xCC82 which is the UTF-8 encoding of the Unicode\u0302 "combining circumflex accent" character while the second is UTF-8 for \u00EE "latin small letter i with circumflex". Similarly for the other pair, the first is the letter a followed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

See Apple's Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

A recent email thread on Apple's java-dev mailing list could be of some help to you.

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

Saygoodbye 2024-09-23 21:02:33

从问题中提取的解决方案:

感谢 Stephen P 让我走上了正确的道路。

对于不耐烦的人来说,首先要修复。如果您使用 Java 6 进行编译,则可以使用 java .text.Normalizer 类将字符串标准化为您选择的通用形式,例如,

// Normalize to "Normalization Form Canonical Decomposition" (NFD)
protected String normalizeUnicode(String str) {
    Normalizer.Form form = Normalizer.Form.NFD;
    if (!Normalizer.isNormalized(str, form)) {
        return Normalizer.normalize(str, form);
    }
    return str;
}

由于 java.text.Normalizer 仅在 Java 6 及更高版本中可用,如果您需要使用 Java 进行编译5 您可能必须求助于 sun.text.Normalizer 实现以及类似的东西 基于反射的黑客另请参阅此标准化功能如何工作?

仅此一项就足够了让我决定不支持使用 Java 5 编译我的项目:|

以下是我在这次肮脏的冒险中学到的其他有趣的事情。

  • 造成混淆的原因是文件名采用两种无法直接比较的规范化形式之一:规范化形式规范分解 (NFD) 或规范化形式规范组合 (NFC)。前者往往有 ASCII 字母,后跟“修饰符”以添加重音符号等,而后者只有扩展字符,没有 ACSCII 前导字符。请阅读 wiki 页面 Stephen P 参考资料以获得更好的解释。

  • 像示例代码中包含的那样的 Unicode 字符串文字(以及在我的真实应用程序中通过 HTTP 接收的文字)采用 NFD 形式,而文件名由 File.listFiles() 方法返回是 NFC。以下小示例演示了这些差异:

    字符串名称 = "Trîcky Nåme";
    System.out.println("原名称:" + URLEncoder.encode(name, "UTF-8"));
    System.out.println("NFC标准化名称:" + URLEncoder.encode(
        Normalizer.normalize(名称, Normalizer.Form.NFC), "UTF-8"));
    System.out.println("NFD 标准化名称:" + URLEncoder.encode(
        Normalizer.normalize(名称, Normalizer.Form.NFD), "UTF-8"));
    

    输出:

    原名:Tri%CC%82cky+Na%CC%8Ame
    NFC 标准化名称:Tr%C3%AEcky+N%C3%A5me
    NFD 标准化名称:Tri%CC%82cky+Na%CC%8Ame
    
  • 如果使用字符串名称构造一个 File 对象,则 File.getName() 方法将返回该名称无论您最初以何种形式提供。但是,如果您调用自行发现名称的 File 方法,它们似乎会以 NFC 形式返回名称。这可能是一个令人讨厌的陷阱。这确实让我着迷。

  • 根据下面来自 Apple 文档 文件的引用名称以分解 (NFD) 形式存储在 HFS Plus 文件系统上:

    <块引用>

    在 Mac OS 中工作时,您会发现自己混合使用了预组合和分解的 Unicode。例如,HFS Plus 将所有文件名转换为分解的 Unicode,而 Macintosh 键盘通常会生成预组合的 Unicode。

    因此,File.listFiles() 方法有助于(?)将文件名转换为(预)组合(NFC)形式。

Solution extracted from question:

Thanks to Stephen P for putting me on the right track.

The fix first, for the impatient. If you are compiling with Java 6 you can use the java.text.Normalizer class to normalize strings into a common form of your choice, e.g.

// Normalize to "Normalization Form Canonical Decomposition" (NFD)
protected String normalizeUnicode(String str) {
    Normalizer.Form form = Normalizer.Form.NFD;
    if (!Normalizer.isNormalized(str, form)) {
        return Normalizer.normalize(str, form);
    }
    return str;
}

Since java.text.Normalizer is only available in Java 6 and later, if you need to compile with Java 5 you might have to resort to the sun.text.Normalizer implementation and something like this reflection-based hack See also How does this normalize function work?

This alone is enough for me to decide I won't support compilation of my project with Java 5 :|

Here are other interesting things I learned in this sordid adventure.

  • The confusion is caused by the file names being in one of two normalization forms which cannot be directly compared: Normalization Form Canonical Decomposition (NFD) or Normalization Form Canonical Composition (NFC). The former tends to have ASCII letters followed by "modifiers" to add accents etc, while the latter has only the extended characters with no ACSCII leading character. Read the wiki page Stephen P references for a better explanation.

  • Unicode string literals like the one contained in the example code (and those received via HTTP in my real app) are in the NFD form, while file names returned by the File.listFiles() method are NFC. The following mini-example demonstrates the differences:

    String name = "Trîcky Nåme";
    System.out.println("Original name: " + URLEncoder.encode(name, "UTF-8"));
    System.out.println("NFC Normalized name: " + URLEncoder.encode(
        Normalizer.normalize(name, Normalizer.Form.NFC), "UTF-8"));
    System.out.println("NFD Normalized name: " + URLEncoder.encode(
        Normalizer.normalize(name, Normalizer.Form.NFD), "UTF-8"));
    

    Output:

    Original name: Tri%CC%82cky+Na%CC%8Ame
    NFC Normalized name: Tr%C3%AEcky+N%C3%A5me
    NFD Normalized name: Tri%CC%82cky+Na%CC%8Ame
    
  • If you construct a File object with a string name, the File.getName() method will return the name in whatever form you gave it originally. However, if you call File methods that discover names on their own, they seem to return names in NFC form. This is a potentially a nasty gotcha. It certainly gotchme.

  • According to the quote below from Apple's documentation file names are stored in decomposed (NFD) form on the HFS Plus file system:

    When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode.

    So the File.listFiles() method helpfully (?) converts file names to the (pre)composed (NFC) form.

毅然前行 2024-09-23 21:02:33

我以前见过类似的东西。将文件从 Mac 上传到网络应用程序的用户使用带有 é 的文件名。

a) 在操作系统中,该 char 是普通的 e +“代表 ´ 的符号应用于前一个字符”

b) 在 Windows 中,它是一个特殊的字符: é

两者都是 Unicode。所以...我知道您将 (b) 选项传递给文件创建,并且在某些时候 Mac OS 将其转换为 (a) 选项。也许如果您在互联网上发现双重代表问题,您可以找到一种方法来成功处理这两种情况。

希望有帮助!

I've seen something similar before. People that uploadde files from their Mac to a webapp used filenames with é.

a) In OS that char is normal e + "sign for ´ applied to the previous char"

b) In Windows it's a special char: é

Both are Unicode. So... I understand you pass the (b) option to File create and at some point Mac OS converts it to the (a) option. Maybe if you find the double representation issue over the internet you can get a way to handle both situations successfully.

Hope it helps!

玻璃人 2024-09-23 21:02:33

在 Unix 文件系统上,文件名实际上是一个以 null 结尾的字节[]。因此java运行时必须在createNewFile()操作期间执行从java.lang.String到byte[]的转换。字符到字节的转换由区域设置控制。我一直在测试将 LC_ALL 设置为 en_US.UTF-8en_US.ISO-8859-1 并获得一致的结果。这是使用 Sun (...Oracle) java 1.6.0_20。但是,对于 LC_ALL=en_US.POSIX,结果是:

File name:   Tr%C3%AEcky+N%C3%A5me
Listed name: Tr%3Fcky+N%3Fme

3F 是一个问号。它告诉我非 ASCII 字符的转换不成功。话又说回来,一切都如预想的那样。

但是两个字符串不同的原因是 \u00EE 字符(或 UTF-8 中的 C3 AE)和序列 i+\u0302 (69 CC 82代码 > UTF-8)。 \u0302 是一个组合变音标记(组合扬抑符)。文件创建过程中发生了某种标准化。我不确定它是在 Java 运行时还是操作系统中完成的。

注意:我花了一些时间才弄清楚,因为您发布的代码片段没有组合变音标记,但有等效的字符 î (例如 \u00ee )。您应该在字符串文字中嵌入 Unicode 转义序列(但之后很容易说出来......)。

On Unix file-system, a file name really is a null-terminated byte[]. So the java runtime has to perform conversion from java.lang.String to byte[] during the createNewFile() operation. The char-to-byte conversion is governed by the locale. I've been testing setting LC_ALL to en_US.UTF-8 and en_US.ISO-8859-1 and got coherent results. This is with Sun (...Oracle) java 1.6.0_20. However, For LC_ALL=en_US.POSIX, the result is:

File name:   Tr%C3%AEcky+N%C3%A5me
Listed name: Tr%3Fcky+N%3Fme

3F is a question mark. It tells me that the conversion was not successful for the non-ASCII character. Then again, everything is as expected.

But the reason why your two strings are different is because of the equivalence between the \u00EE character (or C3 AE in UTF-8) and the sequence i+\u0302 (69 CC 82 in UTF-8). \u0302 is a combining diacritical mark (combining circumflex accent). Some sort of normalization occurred during the file creation. I'm not sure if it's done in the Java run-time or the OS.

NOTE: I took me some time to figure it out since the code snippet that you've posted do not have a combining diacritical mark but the equivalent character î (e.g. \u00ee). You should have embedded the Unicode escape sequence in the string literal (but it's easy to say that afterward...).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文