使用 Perl 玩转 Unicode

发布于 2024-12-17 10:03:02 字数 1056 浏览 5 评论 0原文

我有一个我认为微不足道的问题。我必须处理德语字母表中的元音变音 (äöü)。在Unicode中，似乎有几种显示它们的方法，其中之一就是组合字符。我需要规范这些不同的方式，将它们全部替换为单字符代码。

这种异常的元音变音很容易找到：它是一个字母aou，后面跟着UTF-8字符\uCC88。所以我认为正则表达式就足够了。

这是我的转换函数，使用 Encoding 包。

# This sub can be extended to include more conversions
sub convert {
    local $_;
    $_ = shift;

    $_ = encode( "utf-8", $_ );

    s/u\xcc\x88/ü/g;
    s/a\xcc\x88/ä/g;
    s/o\xcc\x88/ö/g;
    s/U\xcc\x88/Ü/g;
    s/A\xcc\x88/Ä/g;
    s/O\xcc\x88/Ö/g;

    return $_;
}

但生成的打印元音变音是一些更狡猾的字符（现在占用 4 个字节），而不是此列表。

我猜问题是 Perl 的内部格式、实际的 UTF-8 和这种编码格式的杂耍。

即使将替换行更改为也

s/u\xcc\x88/\xc3\xbc/g;
s/a\xcc\x88/\xc3\xa4/g;
s/o\xcc\x88/\xc3\xb6/g;
s/U\xcc\x88/\xc3\x9c/g;
s/A\xcc\x88/\xc3\x84/g;
s/O\xcc\x88/\xc3\x96/g;

没有帮助，它们被正确转换，但随后在字节中跟随“\xC2\xA4”。

有什么帮助吗？

原文

I have a problem I thought to be trivial. I have to deal with Umlauts from the German alphabet (äöü). In Unicode, there seem to be several ways to display them, one of them is combining characters. I need to normalise these different ways, replace them all by the one-character code.

Such a deviant umlaut is easily found: It is a letter aou, followed by the UTF-8 char \uCC88. So I thought a regex would suffice.

This is my conversion function, employing the Encoding package.

# This sub can be extended to include more conversions
sub convert {
    local $_;
    $_ = shift;

    $_ = encode( "utf-8", $_ );

    s/u\xcc\x88/ü/g;
    s/a\xcc\x88/ä/g;
    s/o\xcc\x88/ö/g;
    s/U\xcc\x88/Ü/g;
    s/A\xcc\x88/Ä/g;
    s/O\xcc\x88/Ö/g;

    return $_;
}

But the resulting printed umlaut is some even more devious character (now taking 4 bytes), instead of the one on this list.

I guess the problem is this juggling with Perl's internal format, actual UTF-8 and this Encoding format.

Even changing the substitution lines to

s/u\xcc\x88/\xc3\xbc/g;
s/a\xcc\x88/\xc3\xa4/g;
s/o\xcc\x88/\xc3\xb6/g;
s/U\xcc\x88/\xc3\x9c/g;
s/A\xcc\x88/\xc3\x84/g;
s/O\xcc\x88/\xc3\x96/g;

did not help, they're converted correctly but then followed by "\xC2\xA4" in the bytes.

Any help?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

也只是曾经 2024-12-24 10:03:02

你做错了：你必须停止在表示级别上弄乱字符的习惯，即在处理文本而不是二进制数据时不要弄乱正则表达式中的字节。

第一步是了解Perl 编码主题。您需要它来理解我将在下一段中使用的术语“字符串”。

当您有字符串时，它可能处于组合（分解）的各种状态中的任何一种。使用模块 Unicode::Normalize 更改字符串，并阅读 Unicode 规范中有关等效和规范化的相关章节对于详细的细节，它们位于该模块文档的底部。

我猜您想要 NFC，但您必须对数据进行健全性检查，看看这是否真的是预期的结果。

use charnames qw(:full);
use Unicode::Normalize qw(NFC);
my $original_character_string = "In des Waldes tiefsten Gr\N{LATIN SMALL LETTER U WITH DIAERESIS}nden ist kein R\N{LATIN SMALL LETTER A}\N{COMBINING DIAERESIS}uber mehr zu finden.";
my $modified_character_string = NFC($original_character_string);
# "In des Waldes tiefsten Gr\x{fc}nden ist kein R\x{e4}uber mehr zu finden."

You're doing it wrong: you must stop the habit of messing with characters on the representation level, i.e. do not fiddle with bytes in regex when you deal with text, not binary data.

The first step is to learn about the topic of encoding in Perl. You need this to understand the term "character strings" I am going to use in the following paragraph.

When you have character string, it might be in any of the various states of (de)composition. Use the module Unicode::Normalize to change a character string, and read the relevant chapters on equivalence and normalisation in the Unicode specification for the gory details, they are linked at the bottom of that module's documentation.

I guess you want NFC, but you have to run a sanity check against your data to see whether that's really the intended result.

use charnames qw(:full);
use Unicode::Normalize qw(NFC);
my $original_character_string = "In des Waldes tiefsten Gr\N{LATIN SMALL LETTER U WITH DIAERESIS}nden ist kein R\N{LATIN SMALL LETTER A}\N{COMBINING DIAERESIS}uber mehr zu finden.";
my $modified_character_string = NFC($original_character_string);
# "In des Waldes tiefsten Gr\x{fc}nden ist kein R\x{e4}uber mehr zu finden."

回复收藏 0 原文

~没有更多了~