关于路径名编码的问题
我做了什么才能在这个路径名中得到如此奇怪的编码?
在我的文件管理器(Dolphin)中,路径名看起来不错。
#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);
my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;
say Dump $s1;
say Dump $s2;
# SV = PV(0x824b50) at 0x9346d8
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
# CUR = 4
# LEN = 16
# SV = PV(0x7a7150) at 0x934c30
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
# CUR = 8
# LEN = 16
say $s1;
say $s2;
# Léo
# Lakmé
$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );
say $s1;
say $s2;
# L�o
# Lakmé
What have I done to get such a strange encoding in this path-name?
In my file manager (Dolphin) the path-name looks good.
#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);
my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;
say Dump $s1;
say Dump $s2;
# SV = PV(0x824b50) at 0x9346d8
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
# CUR = 4
# LEN = 16
# SV = PV(0x7a7150) at 0x934c30
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
# CUR = 8
# LEN = 16
say $s1;
say $s2;
# Léo
# Lakmé
$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );
say $s1;
say $s2;
# L�o
# Lakmé
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不幸的是,您的操作系统的路径名 API 是另一个“二进制接口”,您必须使用
Encode::encode
和Encode::decode
才能获得可预测的结果。大多数操作系统将路径名视为八位位组(即字节)的序列。该序列是否应解释为 latin-1、UTF-8 或其他字符编码由应用程序决定。因此,
readdir()
返回的值只是一个八位位组序列,并且File::Find
不知道您希望将路径名作为 Unicode 代码点。它通过简单地将目录路径(您提供的)与操作系统通过readdir()
返回的值连接起来形成$File::Find::name
,这就是如何你得到了用八位字节混合的代码点。经验法则:每当将路径名传递给操作系统时,
Encode::encode()
它以确保它是一个八位位组序列。从操作系统获取路径名时,Encode::decode()
将其转换为应用程序所需的字符集。您可以通过调用
find
使程序正常运行这样:然后在使用
$File::Find::name
的值时调用Encode::decode()
:更清楚地说,这就是
$File::Find::name
是形成:Unfortunately your operating system's pathname API is another "binary interface" where you will have to use
Encode::encode
andEncode::decode
to get predictable results.Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by
readdir()
is simply a sequence of octets, andFile::Find
doesn't know that you want the path name as Unicode code points. It forms$File::Find::name
by simply concatenating the directory path (which you supplied) with the value returned by your OS viareaddir()
, and that's how you got code points mashed with octets.Rule of thumb: Whenever passing path names to the OS,
Encode::encode()
it to make sure it is a sequence of octets. When getting a path name from the OS,Encode::decode()
it to the character set that your application wants it in.You can make your program work by calling
find
this way:And then calling
Encode::decode()
when using the value of$File::Find::name
:To be more clear, this is how
$File::Find::name
was formed:由于未强制执行编码,POSIX 文件系统 API 已损坏。时期。
可能会发生很多问题。例如,路径名甚至可以同时包含 latin1 和 UTF-8,具体取决于路径上的各种文件系统如何处理编码(以及是否处理编码)。
The POSIX filesystem API is broken as no encoding is enforced. Period.
Many problems can happen. For example a pathname can even contain both latin1 and UTF-8 depending on how various filesystems on a path handle encoding (and if they do).