从 Perl 中以 utf8 模式打开的 ASCII 文本文件中过滤出 microsoft 1252 字符

发布于 2024-12-11 02:28:58 字数 1019 浏览 6 评论 0原文

我有一个大小合理的文本文档平面文件数据库，大部分以 8859 格式保存，这些文本文档是通过 Web 表单（使用 Perl 脚本）收集的。直到最近，我一直在使用一组简单的正则表达式来协商常见的 1252 个字符（大引号、撇号等）：

$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right

... 等等。

但是，自从我决定应该使用 Unicode 以来，我已经将所有脚本转换为可读取的并输出 utf8（这适用于所有新材料），这些（现有）1252 个字符的正则表达式不再起作用，我的 Perl html 输出字面上输出 4 个字符： '\x92' 和 '\x93' 等（至少这就是它在 utf8 模式下的浏览器上的显示方式，下载（ftp 而不是 http）并在文本编辑器（文本板）中打开它是不同的，仍然存在一个未定义的字符，并且在 Firefox 默认情况下打开输出文件（无内容类型标头）8859 模式会呈现正确的字符）。

脚本开头的新 utf8 编译指示为：

use CGI qw(-utf8); 使用开放IO => ':utf8';

我知道这是由于 utf8 模式使字符成为双字节而不是单字节，并且适用于 0x80 到 0xff 范围内的那些字符，阅读了与此相关的维基书籍上的文章，但是我不知道如何过滤它们。理想情况下，我知道我应该以 utf8 模式重新保存所有文档（因为平面文件数据库现在包含 8859 和 utf8 的混合），但是如果我无论如何都要这样做，我首先需要某种过滤器。

对于内部的 2 字节存储，我可能是错误的，因为它似乎确实意味着 Perl 根据不同的情况以非常不同的方式处理东西。

如果有人能为我提供正则表达式解决方案，我将非常感激。或者其他一些方法。几周以来，我一直在为此绞尽脑汁，进行了各种尝试，但都失败了。通常需要替换的字符大约只有 6 1252 个，通过过滤方法，我可以用 utf8 重新保存整个 Flippin 批次，而忘记曾经有过 1252 个字符……

原文

I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:

$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right

... etc.

However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).

The new utf8 pragmas at the start of the script are:

use CGI qw(-utf8);
use open IO => ':utf8';

I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.

And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.

If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若沐 2024-12-18 02:28:58

Encoding::FixLatin 专门用于帮助修复损坏的数据，其方式与你的。

回复收藏 0 原文

抱着落日 2024-12-18 02:28:58

池上已经提到了编码::FixLatin 模块。

另一种方法是，如果您知道每个字符串将是 UTF-8 或 CP1252，但不是两者的混合，则将其作为二进制字符串读取并执行以下操作：

unless ( utf8::decode($string) ) {
    require Encode;
    $string = Encode::decode(cp1252 => $string);
}

与编码相比： :FixLatin，这有两个小优点：将 CP1252 文本误解为 UTF-8 的可能性稍低（因为整个字符串必须是有效的 UTF-8）以及用其他一些替换 CP1252 的可能性后备编码。相应的缺点是，由于某些其他原因（例如因为它们在多字节字符的中间被截断），此代码可能会在不完全有效的 UTF-8 字符串上回退到 CP1252。

Ikegami already mentioned the Encoding::FixLatin module.

Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:

unless ( utf8::decode($string) ) {
    require Encode;
    $string = Encode::decode(cp1252 => $string);
}

Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.

回复收藏 0 原文

巾帼英雄 2024-12-18 02:28:58

您还可以使用 Encode.pm 对后备。

use Encode qw[decode];

my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
    my ($ordinal) = @_;
    return decode('Windows-1252', pack 'C', $ordinal);
});

printf "<%s>\n", 
  join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;

输出：

<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>

You could also use Encode.pm's support for fallback.

use Encode qw[decode];

my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
    my ($ordinal) = @_;
    return decode('Windows-1252', pack 'C', $ordinal);
});

printf "<%s>\n", 
  join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;

Output:

<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>

回复收藏 0 原文

手长情犹 2024-12-18 02:28:58

您重新编码了数据文件吗？如果不是，则无法以 UTF-8 格式打开它们。您只需打开它们即可

open $filehandle, '<:encoding(cp1252)', $filename or die ...;

，一切（tm）都应该可以工作。

如果您确实重新编码，则似乎出现了问题，您需要分析问题所在并修复它。我建议使用 hexdump 来找出文件中的实际内容。文本控制台和编辑器有时会对您撒谎，但 hexdump 永远不会撒谎。

Did you recode the data files? If not, opening them as UTF-8 won't work. You can simply open them as

open $filehandle, '<:encoding(cp1252)', $filename or die ...;

and everything (tm) should work.

If you did recode, something seem to have gone wrong, and you need to analyze what it is, and fix it. I recommend using hexdump to find out what actually is in a file. Text consoles and editors sometimes lie to you, hexdump never lies.

回复收藏 0 原文

~没有更多了~