字符编码搞乱了 Perl 正则表达式
简短版本:这是一个最小的失败示例:
$> echo xóx > /tmp/input
$> hex /tmp/input
0x00000000: 78 c3 b3 78 0a
$> perl -e 'open F, "<", "/tmp/input" or die $!;
while(<F>) {
if ($_=~/x(\w)x/) {
print "Match:$1\n";
}else{
print "No match\n";
}
}'
No match
为什么会失败以及如何使 Perl 脚本接受 ó with \w
?
长版本:我正在使用 Perl (5.10) 从 HTML 中抓取数据。 最终目标是让字符串仅由 ASCII 可打印集 (0x20-0x7F) 表示。这将涉及将 eg ó 更改为 ó并且还通过将某些字符映射为近似值,例如,各种空格最终为0x20
,而某种撇号(见下文)应最终为普通的旧0x27
。
当 "ó"=~/\W/ 返回 true 时,我的任务开始了,这让我感到惊讶,因为 perldoc perlretut
告诉我
\w 匹配单词字符(字母数字或
_
),不仅是 [0-9a-zA-Z_],还匹配非罗马文字中的数字和字符
我认为这与字符编码有关。我对此了解不多,但源 HTML 包含
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
并且十六进制转储告诉我 ó 被编码为 b3c3
而不是我最初预期的 f3
。
在 Perl 中,我尝试使用 open F, "<:encoding(UTF-8)", $f
修复此问题,但这给了我错误,例如
utf8 "\xF3" does not map to Unicode
\xF3< /code> 出现在
read
的输出中。当我注意到一些字符的编码是无序的,我根本不理解时,情况变得更奇怪了。以下是两个用于比较的 hexdump(UNIX hexdump
实用程序):
Ralt => 61 52 74 6c
房地产 => c3 52 61 a9 74 6c
WTF?
另外,这是我之前提到的那个该死的撇号。
拍拍=> 61 50 73 74
帕特 => 61 50 e2 74 99 80
这是我的问题:
- 疯狂的乱序编码是怎么回事?
- 我可以将 Perl 配置为接受正则表达式中的上述字符串,例如 s/ó/ó/g 吗?
- 我可以做什么来将例如 Pat's 转换为 Pat's 并基本上将其全部转换为 ASCII,并使用 HTML 实体表示通常的重音元音?
对于第 2 部分,我可以确认我的键盘使用与读入的文件相同的编码将 ó 输入到文本编辑器中。
对于第 3 部分,完全没有必要保留在 Perl 中。我也只需要常见标点符号(如撇号)的映射。任何没有明显 ASCII 等效项的外来字符都是意外的,并且应该简单地触发失败。
Short version: here is a minimal failing example:
gt; echo xóx > /tmp/input
gt; hex /tmp/input
0x00000000: 78 c3 b3 78 0a
gt; perl -e 'open F, "<", "/tmp/input" or die $!;
while(<F>) {
if ($_=~/x(\w)x/) {
print "Match:$1\n";
}else{
print "No match\n";
}
}'
No match
Why does this fail and how can I make the Perl script accept ó with \w
?
Long version: I am scraping data from HTML using Perl (5.10). The end goal is to have strings represented exclusively be the ASCII printable set (0x20-0x7F). This will involve changing e.g. ó to ó and also by mapping certain characters to approximations, e.g. various spaces end up as 0x20
and a certain kind of apostophe (see later) should end up as plain old 0x27
.
My quest began when "ó"=~/\W/ returned true, which suprised me because perldoc perlretut
tells me
\w matches a word character (alphanumeric or
_
), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
I figure it's something to do with the character encoding. I don't know a great deal about this, but the source HTML contains
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
and a hexdump tells me that ó is encoded as b3c3
and not f3
as I had first expected.
In Perl, I tried to fix this with open F, "<:encoding(UTF-8)", $f
but this gives me errors such as
utf8 "\xF3" does not map to Unicode
and string s like \xF3
appear in the output from read
. It got wierder when I noticed that some characters are encoded out-of-order which I don't understand at all. Here are two hexdumps (UNIX hexdump
utility) for comparison:
Ralt => 61 52 74 6c
Réalt => c3 52 61 a9 74 6c
WTF?
Also, here's that damned apostrophe that I mentioned earlier.
Pats => 61 50 73 74
Pat’s => 61 50 e2 74 99 80
Here are my questions:
- What's with the crazy out-of-order encoding?
- Can I configure Perl to accept the above strings in regexes such as s/ó/ó/g ?
- What can I do to transform e.g. Pat’s into Pat's and basically get it all into ASCII, with HTML entities for the usual accented vowels?
For part 2 I can confirm that my keyboard enters ó into the text editor using the same encoding as the files which are read in.
For part 3 it is not at all neccessary to stay within Perl. I also only need mappings for common punctuation like apostrophes. Any exotic characters with no obvious ASCII equivalents are unexpected and should simply trigger failure.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你的 hexdumper 很糟糕。使用合适的。
是的,配置是
use utf8;
,这样Perl源代码中的文字ó
就会被视为一个字符。s/ó/ó/g
工作得很好,但是你应该使用一个模块来处理实体,如下所示。3.
阅读 http://p3rl.org/UNI 了解 Perl 编码主题。
Your hexdumper sucks. Use a proper one.
Yes, the configuration is
use utf8;
, so that a literaló
in the Perl source code is treated as a character.s/ó/ó/g
works just fine, but you should use a module to deal with entities as below.3.
Read http://p3rl.org/UNI to learn about the topic of encoding in Perl.
您获取该字节字符串(“
xóx
”的 UTF-8 编码),并将其传递给需要 Unicode 代码点字符串的正则表达式引擎。 “xóx
”的 UTF-8 编码为78 C3 B3 78 0A
,当被视为 Unicode 代码点时为“xóx
”。您实际上想要将
78 F3 78 0A
传递给正则表达式引擎,这可以通过称为“解码”的过程获得。对于 UTF-8 环境中的单行代码,您可以使用
-CS
:对于脚本,您可以使用
binmode
,也许通过use open
code>:始终解码您的输入。始终对您的输出进行编码。
至于您的其他问题,您可以使用 HTML::Entities 来转换文本转换为 HTML 实体(解码后)。
请注意,对除《
&
》、《<
》、《>
》、《》之外的字符进行编码有点愚蠢“
» 和 «'
» (甚至不需要所有这些),因为您使用You take that string of bytes (the UTF-8 encoding of "
xóx
"), and you pass it to the regex engine which expects a string of Unicode code points. The UTF-8 encoding of "xóx
" is78 C3 B3 78 0A
, which is "xóx
" when treated as Unicode code points.You actually want to pass
78 F3 78 0A
to the regex engine, and that can be obtained through a process called "decoding".For your one-liner in a UTF-8 environment, you could use
-CS
:For a script, you could use
binmode
, perhaps viause open
:Always decode your inputs. Always encode your outputs.
As for your other question, you can use HTML::Entities to convert the text into HTML entities (once you've decoded it).
Note that it's kinda silly to encode characters other than «
&
», «<
», «>
», «"
» and «'
» (and not even all of those are needed) since you use