使用 Perl 玩转 Unicode
我有一个我认为微不足道的问题。我必须处理德语字母表中的元音变音 (äöü
)。在Unicode中,似乎有几种显示它们的方法,其中之一就是组合字符。我需要规范这些不同的方式,将它们全部替换为单字符代码。
这种异常的元音变音很容易找到:它是一个字母aou
,后面跟着UTF-8字符\uCC88
。所以我认为正则表达式就足够了。
这是我的转换函数,使用 Encoding
包。
# This sub can be extended to include more conversions
sub convert {
local $_;
$_ = shift;
$_ = encode( "utf-8", $_ );
s/u\xcc\x88/ü/g;
s/a\xcc\x88/ä/g;
s/o\xcc\x88/ö/g;
s/U\xcc\x88/Ü/g;
s/A\xcc\x88/Ä/g;
s/O\xcc\x88/Ö/g;
return $_;
}
但生成的打印元音变音是一些更狡猾的字符(现在占用 4 个字节),而不是此 列表。
我猜问题是 Perl 的内部格式、实际的 UTF-8 和这种编码格式的杂耍。
即使将替换行更改为也
s/u\xcc\x88/\xc3\xbc/g;
s/a\xcc\x88/\xc3\xa4/g;
s/o\xcc\x88/\xc3\xb6/g;
s/U\xcc\x88/\xc3\x9c/g;
s/A\xcc\x88/\xc3\x84/g;
s/O\xcc\x88/\xc3\x96/g;
没有帮助,它们被正确转换,但随后在字节中跟随“\xC2\xA4”。
有什么帮助吗?
I have a problem I thought to be trivial. I have to deal with Umlauts from the German alphabet (äöü
). In Unicode, there seem to be several ways to display them, one of them is combining characters. I need to normalise these different ways, replace them all by the one-character code.
Such a deviant umlaut is easily found: It is a letter aou
, followed by the UTF-8 char \uCC88
. So I thought a regex would suffice.
This is my conversion function, employing the Encoding
package.
# This sub can be extended to include more conversions
sub convert {
local $_;
$_ = shift;
$_ = encode( "utf-8", $_ );
s/u\xcc\x88/ü/g;
s/a\xcc\x88/ä/g;
s/o\xcc\x88/ö/g;
s/U\xcc\x88/Ü/g;
s/A\xcc\x88/Ä/g;
s/O\xcc\x88/Ö/g;
return $_;
}
But the resulting printed umlaut is some even more devious character (now taking 4 bytes), instead of the one on this list.
I guess the problem is this juggling with Perl's internal format, actual UTF-8 and this Encoding format.
Even changing the substitution lines to
s/u\xcc\x88/\xc3\xbc/g;
s/a\xcc\x88/\xc3\xa4/g;
s/o\xcc\x88/\xc3\xb6/g;
s/U\xcc\x88/\xc3\x9c/g;
s/A\xcc\x88/\xc3\x84/g;
s/O\xcc\x88/\xc3\x96/g;
did not help, they're converted correctly but then followed by "\xC2\xA4" in the bytes.
Any help?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你做错了:你必须停止在表示级别上弄乱字符的习惯,即在处理文本而不是二进制数据时不要弄乱正则表达式中的字节。
第一步是了解Perl 编码主题。您需要它来理解我将在下一段中使用的术语“字符串”。
当您有字符串时,它可能处于组合(分解)的各种状态中的任何一种。使用模块 Unicode::Normalize 更改字符串,并阅读 Unicode 规范中有关等效和规范化的相关章节对于详细的细节,它们位于该模块文档的底部。
我猜您想要 NFC,但您必须对数据进行健全性检查,看看这是否真的是预期的结果。
You're doing it wrong: you must stop the habit of messing with characters on the representation level, i.e. do not fiddle with bytes in regex when you deal with text, not binary data.
The first step is to learn about the topic of encoding in Perl. You need this to understand the term "character strings" I am going to use in the following paragraph.
When you have character string, it might be in any of the various states of (de)composition. Use the module Unicode::Normalize to change a character string, and read the relevant chapters on equivalence and normalisation in the Unicode specification for the gory details, they are linked at the bottom of that module's documentation.
I guess you want
NFC
, but you have to run a sanity check against your data to see whether that's really the intended result.