字符编码搞乱了 Perl 正则表达式

发布于 2025-01-06 01:26:23 字数 1876 浏览 0 评论 0原文

简短版本:这是一个最小的失败示例:

$> echo xóx > /tmp/input
$> hex /tmp/input
0x00000000: 78 c3 b3 78 0a
$> perl -e 'open F, "<", "/tmp/input" or die $!;
       while(<F>) {
           if ($_=~/x(\w)x/) {
               print "Match:$1\n";
           }else{
               print "No match\n";
           }
       }'
No match

为什么会失败以及如何使 Perl 脚本接受 ó with \w?


长版本:我正在使用 Perl (5.10) 从 HTML 中抓取数据。 最终目标是让字符串仅由 ASCII 可打印集 (0x20-0x7F) 表示。这将涉及将 eg ó 更改为 &oacute;并且还通过将某些字符映射为近似值,例如,各种空格最终为0x20,而某种撇号(见下文)应最终为普通的旧0x27

当 "ó"=~/\W/ 返回 true 时,我的任务开始了,这让我感到惊讶,因为 perldoc perlretut 告诉我

\w 匹配单词字符(字母数字或 _),不仅是 [0-9a-zA-Z_],还匹配非罗马文字中的数字和字符

我认为这与字符编码有关。我对此了解不多,但源 HTML 包含

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

并且十六进制转储告诉我 ó 被编码为 b3c3 而不是我最初预期的 f3

在 Perl 中,我尝试使用 open F, "<:encoding(UTF-8)", $f 修复此问题,但这给了我错误,例如

utf8 "\xF3" does not map to Unicode

\xF3< /code> 出现在 read 的输出中。当我注意到一些字符的编码是无序的,我根本不理解时,情况变得更奇怪了。以下是两个用于比较的 hexdump(UNIX hexdump 实用程序):

Ralt => 61 52 74 6c

房地产 => c3 52 61 a9 74 6c

WTF?

另外,这是我之前提到的那个该死的撇号。

拍拍=> 61 50 73 74

帕特 => 61 50 e2 74 99 80

这是我的问题:

  1. 疯狂的乱序编码是怎么回事?
  2. 我可以将 Perl 配置为接受正则表达式中的上述字符串,例如 s/ó/ó/g 吗?
  3. 我可以做什么来将例如 Pat's 转换为 Pat's 并基本上将其全部转换为 ASCII,并使用 HTML 实体表示通常的重音元音?

对于第 2 部分,我可以确认我的键盘使用与读入的文件相同的编码将 ó 输入到文本编辑器中。

对于第 3 部分,完全没有必要保留在 Perl 中。我也只需要常见标点符号(如撇号)的映射。任何没有明显 ASCII 等效项的外来字符都是意外的,并且应该简单地触发失败。

Short version: here is a minimal failing example:

gt; echo xóx > /tmp/input
gt; hex /tmp/input
0x00000000: 78 c3 b3 78 0a
gt; perl -e 'open F, "<", "/tmp/input" or die $!;
       while(<F>) {
           if ($_=~/x(\w)x/) {
               print "Match:$1\n";
           }else{
               print "No match\n";
           }
       }'
No match

Why does this fail and how can I make the Perl script accept ó with \w?


Long version: I am scraping data from HTML using Perl (5.10). The end goal is to have strings represented exclusively be the ASCII printable set (0x20-0x7F). This will involve changing e.g. ó to ó and also by mapping certain characters to approximations, e.g. various spaces end up as 0x20 and a certain kind of apostophe (see later) should end up as plain old 0x27.

My quest began when "ó"=~/\W/ returned true, which suprised me because perldoc perlretut tells me

\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts

I figure it's something to do with the character encoding. I don't know a great deal about this, but the source HTML contains

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

and a hexdump tells me that ó is encoded as b3c3 and not f3 as I had first expected.

In Perl, I tried to fix this with open F, "<:encoding(UTF-8)", $f but this gives me errors such as

utf8 "\xF3" does not map to Unicode

and string s like \xF3 appear in the output from read. It got wierder when I noticed that some characters are encoded out-of-order which I don't understand at all. Here are two hexdumps (UNIX hexdump utility) for comparison:

Ralt => 61 52 74 6c

Réalt => c3 52 61 a9 74 6c

WTF?

Also, here's that damned apostrophe that I mentioned earlier.

Pats => 61 50 73 74

Pat’s => 61 50 e2 74 99 80

Here are my questions:

  1. What's with the crazy out-of-order encoding?
  2. Can I configure Perl to accept the above strings in regexes such as s/ó/ó/g ?
  3. What can I do to transform e.g. Pat’s into Pat's and basically get it all into ASCII, with HTML entities for the usual accented vowels?

For part 2 I can confirm that my keyboard enters ó into the text editor using the same encoding as the files which are read in.

For part 3 it is not at all neccessary to stay within Perl. I also only need mappings for common punctuation like apostrophes. Any exotic characters with no obvious ASCII equivalents are unexpected and should simply trigger failure.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

把梦留给海 2025-01-13 01:26:24
  1. 你的 hexdumper 很糟糕。使用合适的。

    <代码>$ echo -n Réalt |十六进制
    0000 52 c3 a9 61 6c 74 R..替代
    $ echo -n 帕特的 |十六进制
    0000 50 61 74 e2 80 99 73 帕...
    
  2. 是的,配置是use utf8;,这样Perl源代码中的文字ó就会被视为一个字符。 s/ó/ó/g 工作得很好,但是你应该使用一个模块来处理实体,如下所示。

3.

    use utf8;
    use HTML::Entities qw(encode_entities);

    encode_entities 'Réalt';    # returns 'Réalt'
    encode_entities 'Pat’s';    # returns 'Pat’s'

阅读 http://p3rl.org/UNI 了解 Perl 编码主题。

  1. Your hexdumper sucks. Use a proper one.

    $ echo -n Réalt | hex
    0000  52 c3 a9 61 6c 74                                 R..alt
    $ echo -n Pat’s | hex
    0000  50 61 74 e2 80 99 73                              Pat...s
    
  2. Yes, the configuration is use utf8;, so that a literal ó in the Perl source code is treated as a character. s/ó/ó/g works just fine, but you should use a module to deal with entities as below.

3.

    use utf8;
    use HTML::Entities qw(encode_entities);

    encode_entities 'Réalt';    # returns 'Réalt'
    encode_entities 'Pat’s';    # returns 'Pat’s'

Read http://p3rl.org/UNI to learn about the topic of encoding in Perl.

琴流音 2025-01-13 01:26:24

您获取该字节字符串(“xóx”的 UTF-8 编码),并将其传递给需要 Unicode 代码点字符串的正则表达式引擎。 “xóx”的 UTF-8 编码为 78 C3 B3 78 0A,当被视为 Unicode 代码点时为“xóx”。

您实际上想要将 78 F3 78 0A 传递给正则表达式引擎,这可以通过称为“解码”的过程获得。

对于 UTF-8 环境中的单行代码,您可以使用 -CS

perl -CSDA -ne'
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
' /tmp/input

对于脚本,您可以使用 binmode,也许通过 use open code>:

use utf8;                             # Source code is UTF-8
use open ':std', ':encoding(UTF-8)';  # Set encoding for STD*
use open IO => ':encoding(UTF-8)';    # Default encoding for files

while (<>) {
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
}

始终解码您的输入。始终对您的输出进行编码。


至于您的其他问题,您可以使用 HTML::Entities 来转换文本转换为 HTML 实体(解码后)。

请注意,对除《&》、《<》、《>》、《》之外的字符进行编码有点愚蠢“» 和 «'» (甚至不需要所有这些),因为您使用

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

You take that string of bytes (the UTF-8 encoding of "xóx"), and you pass it to the regex engine which expects a string of Unicode code points. The UTF-8 encoding of "xóx" is 78 C3 B3 78 0A, which is "xóx" when treated as Unicode code points.

You actually want to pass 78 F3 78 0A to the regex engine, and that can be obtained through a process called "decoding".

For your one-liner in a UTF-8 environment, you could use -CS:

perl -CSDA -ne'
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
' /tmp/input

For a script, you could use binmode, perhaps via use open:

use utf8;                             # Source code is UTF-8
use open ':std', ':encoding(UTF-8)';  # Set encoding for STD*
use open IO => ':encoding(UTF-8)';    # Default encoding for files

while (<>) {
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
}

Always decode your inputs. Always encode your outputs.


As for your other question, you can use HTML::Entities to convert the text into HTML entities (once you've decoded it).

Note that it's kinda silly to encode characters other than «&», «<», «>», «"» and «'» (and not even all of those are needed) since you use

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文