Perl 的 YAML::XS 和 unicode

发布于 2024-11-16 06:36:54 字数 746 浏览 1 评论 0原文

我试图在 unicode 字母上使用 perl 的 YAML::XS 模块,但它似乎没有按应有的方式工作。

我在脚本中写了这个(以 utf-8 格式保存),

use utf8;
binmode STDOUT, ":utf8"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;

而不是正常的内容,而是打印了 -: Å 。根据此链接,不过,它应该工作正常。

是的,当我将 YAML::XS::Load 返回时,我再次获得了正确的字符串,但我不喜欢转储的字符串似乎采用了错误的编码。

我做错了什么吗?坦率地说,我总是不确定 perl 中的 unicode...

澄清:我的控制台支持 UTF-8。另外,当我将其打印到文件时,使用 open $file, ">:utf8" 而不是 STDOUT 打开 utf8 句柄,它仍然无法打印正确的 utf -8 个字母。

I am trying to use perl's YAML::XS module on unicode letters and it doesn't seem working the way it should.

I write this in the script (which is saved in utf-8)

use utf8;
binmode STDOUT, ":utf8"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;

Instead of something sane, -: Å is printed. According to this link, though, it should be working fine.

Yes, when I YAML::XS::Load it back, I got the correct strings again, but I don't like the fact the dumped string seems to be in some wrong encoding.

Am I doing something wrong? I am always unsure about unicode in perl, to be frank...

clarification: my console supports UTF-8. Also, when I print it to file, opened with utf8 handle with open $file, ">:utf8" instead of STDOUT, it still doesn't print correct utf-8 letters.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

吐个泡泡 2024-11-23 06:36:54

是的,你做错了什么。您误解了您提到的链接的含义。 转储 & Load 使用原始 UTF-8 字节;即包含 UTF-8 但 UTF-8 标志关闭的字符串。

当您使用 :utf8 层将这些字节打印到文件句柄时,它们会被解释为 Latin-1 并转换为 UTF-8,产生双编码输出(只要满足以下条件就可以成功读回)你双重解码它)。您想要改为binmode STDOUT, ':raw'

另一种选择是对 Dumputf8::decode >。这会将原始 UTF-8 字节转换为字符串(带有 UTF-8 标志)。然后,您可以将该字符串打印到 :utf8 文件句柄。

因此,或者

use utf8;
binmode STDOUT, ":raw"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;

或者

use utf8;
binmode STDOUT, ":utf8"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;

同样,从文件读取时,您希望以 :raw 模式读取或在字符串上使用 utf8::encode ,然后再将其传递给 加载。

如果可能,您应该只使用 DumpFile & LoadFile,让 YAML::XS 正确打开文件。但如果你想使用 STDIN/STDOUT,你就必须处理 Dump & 。 加载

Yes, you're doing something wrong. You've misunderstood what the link you mentioned means. Dump & Load work with raw UTF-8 bytes; i.e. strings containing UTF-8 but with the UTF-8 flag off.

When you print those bytes to a filehandle with the :utf8 layer, they get interpreted as Latin-1 and converted to UTF-8, producing double-encoded output (which can be read back successfully as long as you double-decode it). You want to binmode STDOUT, ':raw' instead.

Another option is to call utf8::decode on the string returned by Dump. This will convert the raw UTF-8 bytes to a character string (with the UTF-8 flag on). You can then print the string to a :utf8 filehandle.

So, either

use utf8;
binmode STDOUT, ":raw"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;

Or

use utf8;
binmode STDOUT, ":utf8"; 
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159

use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;

Likewise, when reading from a file, you want to read in :raw mode or use utf8::encode on the string before passing it to Load.

When possible, you should just use DumpFile & LoadFile, letting YAML::XS deal with opening the file correctly. But if you want to use STDIN/STDOUT, you'll have to deal with Dump & Load.

夜还是长夜 2024-11-23 06:36:54

如果您不使用 binmode STDOUT, ":utf8";,它就可以工作。只是不要问我为什么。

It works if you don't use binmode STDOUT, ":utf8";. Just don't ask me why.

锦爱 2024-11-23 06:36:54

我将 next 用于 utf-8 JSON 和 YAML。没有错误处理,但可以展示如何做。
下面允许我:

  • 在输入上使用 NFC 归一化,在输出上使用 NO NDF。只需使用 NFC 中的所有内容,
  • 就可以使用启用 utf8 的 vim 和 bash 工具编辑 YAML/JSON 文件,
  • perl 的工作方式类似于 \w 正则表达式和 lc uc 等等(至少满足我的需要)
  • 源代码是utf8,所以可以写正则表达式 /á/

我的“broilerplate”...

use 5.014;
use warnings;

use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);

use File::Slurp;
use YAML::XS;
use JSON::XS;

run();
exit;

sub run {
    my $yfilein = "./in.yaml"; #input yaml
    my $jfilein = "./in.json"; #input json
    my $yfileout = "./out.yaml"; #output yaml
    my $jfileout = "./out.json"; #output json

    my $ydata = load_utf8_yaml($yfilein);
    my $jdata = load_utf8_json($jfilein);

    #the "uc" is not "fully correct" but works for my needs
    $ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
    $jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;

    save_utf8_yaml($yfileout, $ydata);
    save_utf8_json($jfileout, $jdata);
}


#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems

sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }

你可以尝试下in.yaml

---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž

I'm using the next for the utf-8 JSON and YAML. No error handling, but can show how to do.
The bellow allows me:

  • uses NFC normalisation on input and NO NDF on output. Simply useing everything in NFC
  • can edit the YAML/JSON files with utf8 enabled vim and bash tools
  • "inside" the perl works things like \w regexes and lc uc and so on (at least for my needs)
  • source code is utf8, so can write regexes /á/

My "broilerplate"...

use 5.014;
use warnings;

use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);

use File::Slurp;
use YAML::XS;
use JSON::XS;

run();
exit;

sub run {
    my $yfilein = "./in.yaml"; #input yaml
    my $jfilein = "./in.json"; #input json
    my $yfileout = "./out.yaml"; #output yaml
    my $jfileout = "./out.json"; #output json

    my $ydata = load_utf8_yaml($yfilein);
    my $jdata = load_utf8_json($jfilein);

    #the "uc" is not "fully correct" but works for my needs
    $ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
    $jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;

    save_utf8_yaml($yfileout, $ydata);
    save_utf8_json($jfileout, $jdata);
}


#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems

sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }

You can try the next in.yaml

---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文