如何反转 Perl 中包含组合字符的字符串?

发布于 2024-08-03 09:58:27 字数 359 浏览 8 评论 0原文

我有字符串 "re\x{0301}sume\x{0301}" (打印如下:简历),我想将其反转为 "e\x{0301} muse\x{0301}r" (émusér)。我无法使用 Perl 的 reverse 因为它处理组合像 "\x{0301}" 这样的字符作为单独的字符,所以我最终得到 "\x{0301}emus\x{0301}er" ( ́emuśer)。如何反转字符串,但仍然尊重组合字符?

I have the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters, so I wind up getting "\x{0301}emus\x{0301}er" ( ́emuśer). How can I reverse the string, but still respect the combining characters?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

海螺姑娘 2024-08-10 09:58:27

您可以使用 \X 特殊转义 (匹配非组合字符和所有以下组合字符)与 split创建一个字素列表(它们之间有空字符串),反转字素列表,然后 加入将它们重新组合在一起:

#!/usr/bin/perl

use strict;
use warnings;

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";

You can use the \X special escape (match a non-combining character and all of the following combining characters) with split to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join them back together:

#!/usr/bin/perl

use strict;
use warnings;

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";
毅然前行 2024-08-10 09:58:27

最好的答案是使用 Unicode::GCString, 正如Sinan指出的


我稍微修改了Chas的示例:

  • 在STDOUT上设置编码以避免“打印中的宽字符”警告;
  • split 中使用正向先行断言(并且没有分隔符保留模式)(显然在 5.10 之后不起作用,所以我删除了它)

这基本上是相同的事情,只是做了一些调整。

use strict;
use warnings;

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;

print <<HERE;
original: [$original]
   wrong: [$wrong]
   right: [$right]
HERE

The best answer is to use Unicode::GCString, as Sinan points out


I modified Chas's example a bit:

  • Set the encoding on STDOUT to avoid "wide character in print" warnings;
  • Use a positive lookahead assertion (and no separator retention mode) in split (doesn't work after 5.10, apparently, so I removed it)

It's basically the same thing with a couple of tweaks.

use strict;
use warnings;

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;

print <<HERE;
original: [$original]
   wrong: [$wrong]
   right: [$right]
HERE
天邊彩虹 2024-08-10 09:58:27

您可以使用 Unicode::GCString

Unicode::GCString 将 Unicode 字符串视为由 Unicode 标准附件 #29 [UAX #29] 定义的扩展字素簇序列。

#!/usr/bin/env perl

use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);

use Unicode::GCString;

my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse @{ $y->as_arrayref };

say "$x -> $wrong";
say "$y -> $correct";

输出:

résumé -> ́emuśer
résumé -> émusér

You can use Unicode::GCString:

Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].

#!/usr/bin/env perl

use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);

use Unicode::GCString;

my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse @{ $y->as_arrayref };

say "$x -> $wrong";
say "$y -> $correct";

Output:

résumé -> ́emuśer
résumé -> émusér
☆獨立☆ 2024-08-10 09:58:27

Perl6::Str->reverse 也可以。

对于字符串résumé,您还可以使用Unicode::Normalize核心模块将字符串更改为完全组合的形式(NFCNFKC) 在反向之前;然而,这不是一个通用的解决方案,因为一些基本字符和修饰符的组合没有预组合的 Unicode 代码点。

Perl6::Str->reverse also works.

In the case of the string résumé, you can also use the Unicode::Normalize core module to change the string to a fully composed form (NFC or NFKC) before reverseing; however, this is not a general solution, because some combinations of base character and modifier have no precomposed Unicode codepoint.

昨迟人 2024-08-10 09:58:27

其他一些答案包含效果不佳的元素。这是在 Perl 5.12 和 5.14 上测试的工作示例。未能指定 binmode 将导致输出生成错误消息。在 split 中使用正向先行断言(并且无分隔符保留模式)将导致我的 Macbook 上的输出不正确。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'unicode_strings';

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";

Some of the other answers contain elements that don't work well. Here is a working example tested on Perl 5.12 and 5.14. Failing to specify the binmode will cause the output to generate error messages. Using a positive lookahead assertion (and no separator retention mode) in split will cause the output to be incorrect on my Macbook.

#!/usr/bin/perl

use strict;
use warnings;
use feature 'unicode_strings';

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文