在 Perl 中将混合的 ISO-8859-1 和 UTF-8 多行字符串强制转换为 UTF-8
考虑以下问题:
多行字符串 $junk
包含一些以 UTF-8 编码的行和一些以 ISO-8859-1 编码的行。我不知道先验哪些行采用哪种编码,因此需要启发式方法。
我想通过对 ISO-8859-1 行进行正确的重新编码,将 $junk
转换为纯 UTF-8。另外,如果处理过程中出现错误,我想提供“尽力而为的结果”而不是抛出错误。
我当前的尝试如下所示:
$junk = force_utf8($junk);
sub force_utf8 {
my $input = shift;
my $output = '';
foreach my $line (split(/\n/, $input)) {
if (utf8::valid($line)) {
utf8::decode($line);
}
$output .= "$line\n";
}
return $output;
}
显然,转换永远不会完美,因为我们缺乏有关每行原始编码的信息。但这是我们能得到的“尽力而为的结果”吗?
您将如何改进 force_utf8(...)
子程序的启发式/功能?
Consider the following problem:
A multi-line string $junk
contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.
I want to turn $junk
into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.
My current attempt looks like this:
$junk = force_utf8($junk);
sub force_utf8 {
my $input = shift;
my $output = '';
foreach my $line (split(/\n/, $input)) {
if (utf8::valid($line)) {
utf8::decode($line);
}
$output .= "$line\n";
}
return $output;
}
Obviously the conversion will never be perfect since we're lacking information about the original encoding of each line. But is this the "best effort result" we can get?
How would you improve the heuristics/functionality of the force_utf8(...)
sub?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我没有任何有用的建议可以提供,除了我会尝试使用 Encode::Guess 首先。
I have no useful advice to offer except that I would have tried using Encode::Guess first.
您也许可以使用一些领域知识来修复它。例如,É 不是 ISO-8859-1 中可能的字符组合;它更有可能是 UTF-8é。
如果您的输入仅限于有限的字符池,您还可以使用启发式方法,例如假设 à 永远不会出现在您的输入流中。
如果没有这种领域知识,你的问题通常会很棘手。
You might be able to fix it up using a bit of domain knowledge. For example, é is not a likely character combination in ISO-8859-1; it is much more likely to be UTF-8 é.
If your input is limited to a restricted pool of characters, you can also use a heuristic such as assuming à will never occur in your input stream.
Without this kind of domain knowledge, your problem is in general intractable.
仅通过查看字符很难判断它是 ISO-8859-1 还是 UTF-8 编码。问题是两者都是 8 位编码,因此仅查看 MSb 是不够的。然后,对于每一行,我都会假设该行是 UTF-8,对其进行转码。当发现无效的 UTF-8 编码时,重新转码该行,假设该行确实是 ISO-8859-1。这种启发式的问题是,您可能会对 ISO-8859-1 行进行转码,而这些行也是格式良好的 UTF-8 行;然而,如果没有有关
$junk
的外部信息,就无法判断哪个是合适的。Just by looking at a character it will be hard to tell if it is ISO-8859-1 or UTF-8 encoded. The problem is that both are 8-bit encodings, so simply looking at the MSb is not sufficient. For every line, then, I would transcode the line assuming it is UTF-8. When an invalid UTF-8 encoding is found re-transcode the line assuming that the line is really ISO-8859-1. The problem with this heuristic is that you might transcode ISO-8859-1 lines that are also well-formed UTF-8 lines; however without external information about
$junk
there is no way to tell which is appropriate.请查看这篇文章。 UTF-8 经过优化,可以用 8 位表示西方语言字符,但它不限于每个字符 8 位。多字节字符使用常见的位模式来指示它们是否是多字节以及该字符使用了多少个字节。如果您可以安全地假设字符串中只有两种编码,那么其余的应该很简单。
Take a look at this article. UTF-8 is optimised to represent Western language characters in 8 bits but it's not limited to 8-bits-per-character. The multibyte characters use common bit patterns to indicate if they are multibyte, and how many bytes the character uses. If you can safely assume only the two encodings in your string, the rest should be simple.
简而言之,我选择用“file -bi”和“iconv -f ISO-8859-1 -t UTF-8”来解决我的问题。
我最近在尝试标准化文件名编码时遇到了类似的问题。我混合使用了 ISO-8859-1、UTF-8 和 ASCII。正如我意识到在处理文件时,我增加了由于目录名具有与文件编码不同的编码而引起的复杂性。
我最初尝试使用 Perl,但它无法正确区分 UTF-8 和 ISO-8859-1,导致 UTF-8 出现乱码。
就我而言,这是对合理文件计数的一次性转换,因此我选择了一种我了解的慢速方法,并且对我来说没有错误(主要是因为每行只有 1-2 个不相邻的字符使用特殊的 ISO- 8859-1 代码)
选项 #1 将 ISO-8859-1 转换为 UTF-8
选项 #2 将 ISO-8859-1 转换为 ASCII
In short, I opted to solve my problem with "file -bi" and "iconv -f ISO-8859-1 -t UTF-8".
I recently ran across a similar problem in trying to normalize the encoding of file names. I had a mixture of ISO-8859-1, UTF-8, and ASCII. As I realized wile processing the files I had added complications caused by the directory name having one encoding that was different then the file's encoding.
I originally tried to use Perl but it could not properly differentiate between UTF-8 and ISO-8859-1 resulting in garbled UTF-8.
In my case it was a one time conversion on a reasonable file count, so I opted for a slow method that I knew about and worked with no errors for me (mostly because only 1-2 non-adjacent chars per line used special ISO-8859-1 codes)
Option #1 converts ISO-8859-1 to UTF-8
Option #2 converts to ISO-8859-1 to ASCII