Java 无法打开文件名中包含代理 Unicode 值的文件?

发布于 2024-08-07 11:57:12 字数 684 浏览 1 评论 0原文

我正在处理对文件进行各种 IO 操作的代码,并且我想让它能够处理国际文件名。我正在使用 Java 1.5 的 Mac 上工作,如果文件名包含需要代理的 Unicode 字符,则 JVM 似乎无法找到该文件。例如,我的测试文件是:

"草鸥外.gif" ,它被分解为 Java 字符 \u8349\uD85B\uDFF6\u9DD7\u5916.gif

如果我从这个文件名创建一个文件,我无法打开它,因为我收到 FileNotFound 异常。即使在包含该文件的文件夹上使用它也会失败:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

我实际处理的大多数代码的形式都是:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

有什么方法可以解决这个问题,要么转义文件名,要么以不同的方式打开文件?

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:

"草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif

If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

Most of the code I am actually dealing with are of the form:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

Is there some way I can address this problem, either escaping the filenames or opening files differently?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不喜欢何必死缠烂打 2024-08-14 11:57:13

这是 old-skool java File api 中的一个错误,也许只是在 Mac 上?不管怎样,新的 java.nio api 工作得更好了。我有几个包含 unicode 字符和内容的文件,无法使用 java.io.File 和相关类加载。将我的所有代码转换为使用 java.nio 后。路径 一切都开始工作了。我用 java.nio.Files...

...并确保使用适当的字符集读取和写入文件的内容,例如:Files.readAllLines(myPath, StandardCharsets .UTF_8)

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...

...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

爱格式化 2024-08-14 11:57:12

我怀疑 Java 或 Mac 之一正在使用 CESU-8 而不是正确的UTF-8。 Java 使用“修改后的 UTF-8”(这是 CESU-8 的轻微变体)用于各种内部目的,但我不知道它可以将其用作文件系统/defaultCharset。不幸的是我这里既没有 Mac 也没有 Java 来测试。

“修改”是“严重窃听”的修改方式。它不是为像

I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:

\xF0\xA6\xBF\xB6

it outputs a UTF-8-encoded sequence for each of the surrogates:

\xED\xA1\x9B\xED\xBF\xB6

This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.

So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:

$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')

On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:

['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:

['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:

os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')
夜访吸血鬼 2024-08-14 11:57:12

如果您的环境的默认区域设置不包含这些字符,您将无法打开该文件。

请参阅:File.exists() 失败,名称中包含 unicode 字符

编辑:
好吧..您需要的是更改系统区域设置。无论您使用什么操作系统。

编辑

请参阅:我该如何在 Java 中打开包含重音符号的文件?

请参阅:Mac上的JFileChooser看不到中文字符命名的文件?

If your environment's default locale does not include those characters you cannot open the file.

See: File.exists() fails with unicode characters in name

Edit:
Alright.. What you need is to change the system locale. Whatever OS you are using.

Edit:

See: How can I open files containing accents in Java?

See: JFileChooser on Mac cannot see files named by Chinese chars?

风渺 2024-08-14 11:57:12

事实证明这是 Mac JVM 的问题(在 1.5 和 1.6 上测试过)。无法使用 Java File 类访问包含增补字符/代理项对的文件名。我最终为该项目的 Mac 版本编写了一个包含 Carbon 调用的 JNI 库(ick)。我怀疑 bobince 提到的 CESU-8 问题,因为获取 UTF-8 字符的 JNI 调用返回了 CESU-8 字符串。看起来这并不是你真正可以解决的事情。

This turned out to be a problem with the Mac JVM (tested on 1.5 and 1.6). Filenames containing supplementary characters / surrogate pairs cannot be accessed with the Java File class. I ended up writing a JNI library with Carbon calls for the Mac version of the project (ick). I suspect the CESU-8 issue bobince mentioned, as the JNI call to get UTF-8 characters returned a CESU-8 string. Doesn't look like it's something you can really get around.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文