如何使用Perl Pack将UTF-16替代配对转换为UTF-8?

发布于 2025-02-06 08:53:59 字数 516 浏览 2 评论 0原文

我有输入字符串,其中包含文本,其中某些字符以UTF-16格式为单位,并以'\ u'逃脱。我正在尝试将所有字符串转换为UTF-8。例如,字符串'Alice&鲍勃& Carol'可能在输入中格式为:

'Alice \ U0026 Bob \ U0026 Carol'

要进行我所需的转换,我正在做...:

$ str = 〜〜 s/\\ u([[a-fa-f0-9] {4})/pack(“ u”,hex($ 1))/eg;

...它工作正常,直到我进入输入字符串其中包含UTF-16代理对,例如:

'Alice \ ud83d \ ude06 bob'

如何修改上述使用pack与UTF-16替代配对配对的代码?我真的想要一个仅使用pack的解决方案,而不必使用任何其他库(JSON :: XS,ENCODE等)。

I have input strings which contain text in which some characters are in UTF-16 format and escaped with '\u'. I am trying to, in Perl, convert all the strings to UTF-8. For example, the string 'Alice & Bob & Carol' might be formatted in the input as:

'Alice \u0026 Bob \u0026 Carol'

To do my desired conversion, I was doing...:

$str =~ s/\\u([A-Fa-f0-9]{4})/pack("U", hex($1))/eg;

...which worked fine until I got to input strings that contained UTF-16 surrogate pairs like:

'Alice \ud83d\ude06 Bob'

How do I modify the above code that uses pack to work with UTF-16 surrogate pairs? I would really like a solution that just uses pack without having to use any additional libraries (JSON::XS, Encode, etc.).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

快乐很简单 2025-02-13 08:54:00

pack/解开不知道UTF-16文本,只有UTF-8(和UTF-EBCDIC)。由于您不想使用模块,因此必须手动解码替代对。

#!/usr/bin/env perl                                                                                                                                                                                                                              
use strict;
use warnings;
use open qw/:locale/;
use feature qw/say/;

my $str = 'Alice \ud83d\ude06 Bob \u0026 Carol';

# Convert surrogate pairs encoded as two \uXXXX sequences
# Only match valid surrogate pairs so adjacent non-pairs aren't counted as one
$str =~ s/\\u((?i)D[89AB]\p{AHex}{2}) # High surrogate in range 0xD800–0xDBFF
          \\u((?i)D[CDEF]\p{AHex}{2}) #  Low surrogate in range 0xDC00–0xDFFF
         /chr( ((hex($1) - 0xD800) * 0x400) + (hex($2) - 0xDC00) + 0x10000 )/xge;
# Convert single \uXXXX sequences
$str =~ s/\\u(\p{AHex}{4})/chr hex $1/ge;

say $str;

输出

Alice 

pack/unpack have no knowledge of UTF-16 text, just UTF-8 (And UTF-EBCDIC). You have to decode the surrogate pairs manually since you don't want to use a module.

#!/usr/bin/env perl                                                                                                                                                                                                                              
use strict;
use warnings;
use open qw/:locale/;
use feature qw/say/;

my $str = 'Alice \ud83d\ude06 Bob \u0026 Carol';

# Convert surrogate pairs encoded as two \uXXXX sequences
# Only match valid surrogate pairs so adjacent non-pairs aren't counted as one
$str =~ s/\\u((?i)D[89AB]\p{AHex}{2}) # High surrogate in range 0xD800–0xDBFF
          \\u((?i)D[CDEF]\p{AHex}{2}) #  Low surrogate in range 0xDC00–0xDFFF
         /chr( ((hex($1) - 0xD800) * 0x400) + (hex($2) - 0xDC00) + 0x10000 )/xge;
# Convert single \uXXXX sequences
$str =~ s/\\u(\p{AHex}{4})/chr hex $1/ge;

say $str;

outputs

Alice ???? Bob & Carol
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文