关于路径名编码的问题

发布于 2024-12-06 12:47:07 字数 900 浏览 0 评论 0原文

我做了什么才能在这个路径名中得到如此奇怪的编码?
在我的文件管理器(Dolphin)中,路径名看起来不错。

#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);

my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;

say Dump $s1;
say Dump $s2;

# SV = PV(0x824b50) at 0x9346d8
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
#   CUR = 4
#   LEN = 16

# SV = PV(0x7a7150) at 0x934c30
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
#   CUR = 8
#   LEN = 16

say $s1;
say $s2;

# Léo
# Lakmé

$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );

say $s1;
say $s2;

# L�o
# Lakmé

What have I done to get such a strange encoding in this path-name?
In my file manager (Dolphin) the path-name looks good.

#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);

my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;

say Dump $s1;
say Dump $s2;

# SV = PV(0x824b50) at 0x9346d8
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
#   CUR = 4
#   LEN = 16

# SV = PV(0x7a7150) at 0x934c30
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
#   CUR = 8
#   LEN = 16

say $s1;
say $s2;

# Léo
# Lakmé

$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );

say $s1;
say $s2;

# L�o
# Lakmé

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

恰似旧人归 2024-12-13 12:47:07

不幸的是,您的操作系统的路径名 API 是另一个“二进制接口”,您必须使用 Encode::encodeEncode::decode 才能获得可预测的结果。

大多数操作系统将路径名视为八位位组(即字节)的序列。该序列是否应解释为 latin-1、UTF-8 或其他字符编码由应用程序决定。因此,readdir() 返回的值只是一个八位位组序列,并且 File::Find 不知道您希望将路径名作为 Unicode 代码点。它通过简单地将目录路径(您提供的)与操作系统通过 readdir() 返回的值连接起来形成 $File::Find::name ,这就是如何你得到了用八位字节混合的代码点。

经验法则:每当将路径名传递给操作系统时,Encode::encode() 它以确保它是一个八位位组序列。从操作系统获取路径名时,Encode::decode() 将其转换为应用程序所需的字符集。

您可以通过调用 find 使程序正常运行这样:

find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );

然后在使用 $File::Find::name 的值时调用 Encode::decode()

my $path = Encode::decode('utf8', $File::Find::name);

更清楚地说,这就是 $File::Find::name 是形成:

use Encode;

# This is a way to get $dir to be represented as a UTF-8 string

my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;

say "dir: ", d($dir); # length = 3

# This is what readdir() is returning:

my $leaf = encode('utf8', 'Lakem' . chr(233));

say "leaf: ", d($leaf); # length = 7

$File::Find::name = $dir . '/' . $leaf;

say "File::Find::name: ", d($File::Find::name);

sub d {
  join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}

Unfortunately your operating system's pathname API is another "binary interface" where you will have to use Encode::encode and Encode::decode to get predictable results.

Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by readdir() is simply a sequence of octets, and File::Find doesn't know that you want the path name as Unicode code points. It forms $File::Find::name by simply concatenating the directory path (which you supplied) with the value returned by your OS via readdir(), and that's how you got code points mashed with octets.

Rule of thumb: Whenever passing path names to the OS, Encode::encode() it to make sure it is a sequence of octets. When getting a path name from the OS, Encode::decode() it to the character set that your application wants it in.

You can make your program work by calling find this way:

find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );

And then calling Encode::decode() when using the value of $File::Find::name:

my $path = Encode::decode('utf8', $File::Find::name);

To be more clear, this is how $File::Find::name was formed:

use Encode;

# This is a way to get $dir to be represented as a UTF-8 string

my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;

say "dir: ", d($dir); # length = 3

# This is what readdir() is returning:

my $leaf = encode('utf8', 'Lakem' . chr(233));

say "leaf: ", d($leaf); # length = 7

$File::Find::name = $dir . '/' . $leaf;

say "File::Find::name: ", d($File::Find::name);

sub d {
  join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}
预谋 2024-12-13 12:47:07

由于未强制执行编码,POSIX 文件系统 API 已损坏。时期。

可能会发生很多问题。例如,路径名甚至可以同时包含 latin1 和 UTF-8,具体取决于路径上的各种文件系统如何处理编码(以及是否处理编码)。

The POSIX filesystem API is broken as no encoding is enforced. Period.

Many problems can happen. For example a pathname can even contain both latin1 and UTF-8 depending on how various filesystems on a path handle encoding (and if they do).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文