NTFS 中的文件名存储为什么编码?
我刚刚开始编写一些程序来处理 WinXP 系统上具有非英文名称的文件名。我已经推荐阅读了一些关于 unicode 的文章,我想我已经了解了基本的想法,但有些部分对我来说仍然不是很清楚。
具体来说,NTFS中存储的文件名(不是内容,而是文件的实际名称)是什么编码(UTF-8、UTF-16LE/BE)?是否可以使用 fopen()(它采用 char*)打开任何文件,或者我别无选择,只能使用 wfopen()(它使用 wchar_t*,并且可能采用 UTF-16 字符串)?
我尝试手动将 UTF-8 编码的字符串输入到 fopen() 中,例如。
unsigned char filename[] = {0xEA, 0xB0, 0x80, 0x2E, 0x74, 0x78, 0x74, 0x0}; // 가.txt
FILE* f = fopen((char*)filename, "wb+");
但结果是“ê°€.txt”。
我的印象(这可能是错误的)UTF8编码的字符串足以在Windows下打开任何文件名,因为我似乎隐约记得一些Windows应用程序传递(char *),而不是(wchar_t *),并且有没问题。
有人能解释一下吗?
I'm just getting started on some programming to handle filenames with non-english names on a WinXP system. I've done some recommended reading on unicode and I think I get the basic idea, but some parts are still not very clear to me.
Specifically, what encoding (UTF-8, UTF-16LE/BE) are the file names (not the content, but the actual name of the file) stored in NTFS? Is it possible to open any file using fopen(), which takes a char*, or do I have no choice but to use wfopen(), which uses a wchar_t*, and presumably takes a UTF-16 string?
I tried manually feeding in a UTF-8 encoded string to fopen(), eg.
unsigned char filename[] = {0xEA, 0xB0, 0x80, 0x2E, 0x74, 0x78, 0x74, 0x0}; // 가.txt
FILE* f = fopen((char*)filename, "wb+");
but this came out as 'ê°€.txt'.
I was under the impression (which may be wrong) that a UTF8-encoded string would suffice in opening any filename under Windows, because I seem to vaguely remember some Windows application passing around (char*), not (wchar_t*), and having no problems.
Can anyone shed some light on this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
NTFS 以 UTF-16 存储文件名,但
fopen
使用 ANSI(不是 UTF-8)。为了使用 UTF16 编码的文件名,您需要使用文件打开调用的 Unicode 版本。通过在项目中定义
UNICODE
和_UNICODE
来完成此操作。然后使用CreateFile
调用或wfopen
调用。NTFS stores filenames in UTF-16, however
fopen
is using ANSI (not UTF-8).In order to use an UTF16-encoded file name you will need to use the Unicode versions of the file open calls. Do this by defining
UNICODE
and_UNICODE
in your project. Then use theCreateFile
call or thewfopen
call.fopen() - 在 Windows 上的 MSVC 中(默认情况下)不采用 utf-8 编码的 char*。
不幸的是,utf-8 是最近才在伟大的计划中发明的。 Windows API 分为 Unicode 和 Ansi 版本。 每个接受或处理字符串的 Windows API 实际上都带有 W 或 A 后缀 - W 表示“宽”字符/Unicode,A 表示 Ansi。宏魔法将所有这些都隐藏在开发人员之外,因此您只需根据您的构建配置使用 char* 或 wchar_t* 调用 CreateFile,而不知道其中的区别。
“Ansi”编码实际上不是特定的编码:- 但意味着用于“char”字符串的编码特定于 PC 的区域设置。
现在,因为 c 运行时函数(如 fopen)需要在默认情况下在开发人员不知情的情况下工作,所以在 Windows 系统上,他们希望以 Windows 本地编码接收字符串。 msdn 指出 microsoft c-runtime api setlocal 可以更改当前线程的区域设置 - 但特别指出,对于每个字符需要超过 2 个字节的任何区域设置(例如 utf-8),它都会失败。
所以,在 Windows 上没有快捷方式。您需要使用 wfopen 或带有 wchar_t* 字符串的本机 API CreateFileW(或使用 Unicode 构建设置创建项目并仅调用 Createfile)。
fopen() - in MSVC on windows does not (by default) take a utf-8 encoded char*.
Unfortunately utf-8 was invented rather recently in the great scheme of things. Windows APIs are divided into Unicode and Ansi versions. every windows api that takes or deals with strings is actually available with a W or A suffix - W for "Wide" character/Unicode and A for Ansi. Macro magic hides all this away from the developer so you just call CreateFile with either a char* or a wchar_t* depending on your build configuration without knowing the difference.
The 'Ansi' encoding is actually not a specific encoding:- But means that the encoding used for "char" strings is specific to the locale setting of the PC.
Now, because c-runtime functions - like fopen - need to work by default without developer knowledge - on windows systems they expect to receive their strings in the windows local encoding. msdn indicates the microsoft c-runtime api setlocal can change the locale of the current thread - but specifically says that it will fail for any locales that need more than 2 bytes per character - like utf-8.
So, on Windows there is no shortcut. You need to use wfopen, or the native API CreateFileW (or create your project using the Unicode build settings and just call Createfile) with wchar_t* strings.
正如其他人所回答的,处理 UTF-8 编码字符串的最佳方法是将它们转换为 UTF-16 并使用本机 Unicode API,例如
_wfopen
或CreateFileW
。但是,当调用无条件使用
fopen()
的库时,此方法将无济于事,因为它们不支持 Unicode 或者因为它们是用可移植 C 编写的。在这种情况下,仍然可以使用传统的“短路径”将 UTF-8 编码的字符串转换为可用于fopen
的 ASCII 形式,但需要一些跑腿工作:将 UTF-8 表示转换为 UTF-16使用
MultiByteToWideChar
。
使用
GetShortPathNameW
获取纯 ASCII 的“短路径”。GetShortPathNameW
会将其作为包含全 ASCII 内容的宽字符串返回,您需要通过无损复制将每个wchar_t
转换为窄字符串char
。将短路径传递给
fopen()
或最终将使用fopen()
的代码。请注意,该代码打印的错误消息(如果有)将引用难看的“短路径”(例如KINTO~1
而不是kinto-un-筋斗云
) .虽然这不完全是推荐的长期策略,因为 Windows 短路径是可以按卷关闭的旧功能,但它可能是将文件名传递给使用 fopen() 的代码的唯一方法/code> 和其他与文件相关的 API 调用(
stat
、access
、ANSI 版本的CreateFile
等)。As answered by others, the best way to handle UTF-8-encoded strings is to convert them to UTF-16 and use native Unicode APIs such as
_wfopen
orCreateFileW
.However, this approach won't help when calling into libraries that use
fopen()
unconditionally because they do not support Unicode or because they are written in portable C. In that case it is still possible to make use of the legacy "short paths" to convert a UTF-8-encoded string into an ASCII form usable withfopen
, but it requires some legwork:Convert the UTF-8 representation to UTF-16 using
MultiByteToWideChar
.Use
GetShortPathNameW
to obtain a "short path" which is ASCII-only.GetShortPathNameW
will return it as a wide string with all-ASCII content, which you will need to trivially convert it to a narrow string by a lossless copy casting eachwchar_t
char
.Pass the short path to
fopen()
or to the code that will eventually usefopen()
. Be aware that error messages printed by that code, if any, will refer to the unsightly "short path" (e.g.KINTO~1
instead ofkinto-un-筋斗雲
).While this is not exactly a recommended long-term strategy, as Windows short paths are a legacy feature that can be turned off per-volume, it is likely the only way to pass file names to code that uses
fopen()
and other file-related API calls (stat
,access
, ANSI versions ofCreateFile
and similar).