字符编码搞乱了 Perl 正则表达式
简短版本:这是一个最小的失败示例:
$> echo xóx > /tmp/input
$> hex /tmp/input
0x00000000: 78 c3 b3 78 0a
$> perl -e 'open F, "<", "/tmp/input" or die $!;
while(<F>) {
if ($_=~/x(\w)x/) {
print "Match:$1\n";
}else{
print "No match\n";
}
}'
No match
为什么会失败以及如何使 Perl 脚本接受 ó with \w
?
长版本:我正在使用 Perl (5.10) 从 HTML 中抓取数据。 最终目标是让字符串仅由 ASCII 可打印集 (0x20-0x7F) 表示。这将涉及将 eg ó 更改为 ó并且还通过将某些字符映射为近似值,例如,各种空格最终为0x20
,而某种撇号(见下文)应最终为普通的旧0x27
。
当 "ó"=~/\W/ 返回 true 时,我的任务开始了,这让我感到惊讶,因为 perldoc perlretut
告诉我
\w 匹配单词字符(字母数字或
_
),不仅是 [0-9a-zA-Z_],还匹配非罗马文字中的数字和字符
我认为这与字符编码有关。我对此了解不多,但源 HTML 包含
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
并且十六进制转储告诉我 ó 被编码为 b3c3
而不是我最初预期的 f3
。
在 Perl 中,我尝试使用 open F, "<:encoding(UTF-8)", $f
修复此问题,但这给了我错误,例如
utf8 "\xF3" does not map to Unicode
\xF3< /code> 出现在
read
的输出中。当我注意到一些字符的编码是无序的,我根本不理解时,情况变得更奇怪了。以下是两个用于比较的 hexdump(UNIX hexdump
实用程序):
Ralt => 61 52 74 6c
房地产 => c3 52 61 a9 74 6c
WTF?
另外,这是我之前提到的那个该死的撇号。
拍拍=> 61 50 73 74
帕特 => 61 50 e2 74 99 80
这是我的问题:
- 疯狂的乱序编码是怎么回事?
- 我可以将 Perl 配置为接受正则表达式中的上述字符串,例如 s/ó/ó/g 吗?
- 我可以做什么来将例如 Pat's 转换为 Pat's 并基本上将其全部转换为 ASCII,并使用 HTML 实体表示通常的重音元音?
对于第 2 部分,我可以确认我的键盘使用与读入的文件相同的编码将 ó 输入到文本编辑器中。
对于第 3 部分,完全没有必要保留在 Perl 中。我也只需要常见标点符号(如撇号)的映射。任何没有明显 ASCII 等效项的外来字符都是意外的,并且应该简单地触发失败。
Short version: here is a minimal failing example:
gt; echo xóx > /tmp/input
gt; hex /tmp/input
0x00000000: 78 c3 b3 78 0a
gt; perl -e 'open F, "<", "/tmp/input" or die $!;
while(<F>) {
if ($_=~/x(\w)x/) {
print "Match:$1\n";
}else{
print "No match\n";
}
}'
No match
Why does this fail and how can I make the Perl script accept ó with \w
?
Long version: I am scraping data from HTML using Perl (5.10). The end goal is to have strings represented exclusively be the ASCII printable set (0x20-0x7F). This will involve changing e.g. ó to ó and also by mapping certain characters to approximations, e.g. various spaces end up as 0x20
and a certain kind of apostophe (see later) should end up as plain old 0x27
.
My quest began when "ó"=~/\W/ returned true, which suprised me because perldoc perlretut
tells me
\w matches a word character (alphanumeric or
_
), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
I figure it's something to do with the character encoding. I don't know a great deal about this, but the source HTML contains
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
and a hexdump tells me that ó is encoded as b3c3
and not f3
as I had first expected.
In Perl, I tried to fix this with open F, "<:encoding(UTF-8)", $f
but this gives me errors such as
utf8 "\xF3" does not map to Unicode
and string s like \xF3
appear in the output from read
. It got wierder when I noticed that some characters are encoded out-of-order which I don't understand at all. Here are two hexdumps (UNIX hexdump
utility) for comparison:
Ralt => 61 52 74 6c
Réalt => c3 52 61 a9 74 6c
WTF?
Also, here's that damned apostrophe that I mentioned earlier.
Pats => 61 50 73 74
Pat’s => 61 50 e2 74 99 80
Here are my questions:
- What's with the crazy out-of-order encoding?
- Can I configure Perl to accept the above strings in regexes such as s/ó/ó/g ?
- What can I do to transform e.g. Pat’s into Pat's and basically get it all into ASCII, with HTML entities for the usual accented vowels?
For part 2 I can confirm that my keyboard enters ó into the text editor using the same encoding as the files which are read in.
For part 3 it is not at all neccessary to stay within Perl. I also only need mappings for common punctuation like apostrophes. Any exotic characters with no obvious ASCII equivalents are unexpected and should simply trigger failure.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你的 hexdumper 很糟糕。使用合适的。
是的,配置是
use utf8;
,这样Perl源代码中的文字ó
就会被视为一个字符。s/ó/ó/g
工作得很好,但是你应该使用一个模块来处理实体,如下所示。3.
阅读 http://p3rl.org/UNI 了解 Perl 编码主题。
Your hexdumper sucks. Use a proper one.
Yes, the configuration is
use utf8;
, so that a literaló
in the Perl source code is treated as a character.s/ó/ó/g
works just fine, but you should use a module to deal with entities as below.3.
Read http://p3rl.org/UNI to learn about the topic of encoding in Perl.
您获取该字节字符串(“
xóx
”的 UTF-8 编码),并将其传递给需要 Unicode 代码点字符串的正则表达式引擎。 “xóx
”的 UTF-8 编码为78 C3 B3 78 0A
,当被视为 Unicode 代码点时为“xóx
”。您实际上想要将
78 F3 78 0A
传递给正则表达式引擎,这可以通过称为“解码”的过程获得。对于 UTF-8 环境中的单行代码,您可以使用
-CS
:对于脚本,您可以使用
binmode
,也许通过use open
code>:始终解码您的输入。始终对您的输出进行编码。
至于您的其他问题,您可以使用 HTML::Entities 来转换文本转换为 HTML 实体(解码后)。
请注意,对除《
&
》、《<
》、《>
》、《》之外的字符进行编码有点愚蠢“
» 和 «'
» (甚至不需要所有这些),因为您使用You take that string of bytes (the UTF-8 encoding of "
xóx
"), and you pass it to the regex engine which expects a string of Unicode code points. The UTF-8 encoding of "xóx
" is78 C3 B3 78 0A
, which is "xóx
" when treated as Unicode code points.You actually want to pass
78 F3 78 0A
to the regex engine, and that can be obtained through a process called "decoding".For your one-liner in a UTF-8 environment, you could use
-CS
:For a script, you could use
binmode
, perhaps viause open
:Always decode your inputs. Always encode your outputs.
As for your other question, you can use HTML::Entities to convert the text into HTML entities (once you've decoded it).
Note that it's kinda silly to encode characters other than «
&
», «<
», «>
», «"
» and «'
» (and not even all of those are needed) since you use