如何使用 Perl 将一堆文件从 ISO-8859-1 转换为 UTF-8？

发布于 2024-08-29 08:56:38 字数 442 浏览 7 评论 0原文

我有几个文档需要从 ISO-8859-1 转换为 UTF-8（当然没有 BOM）。但这就是问题所在。我有很多这样的文档（它实际上是文档的混合，一些 UTF-8 和一些 ISO-8859-1），我需要一种自动转换它们的方法。不幸的是我只安装了ActivePerl并且不太了解该语言的编码。我也许可以安装 PHP，但我不确定，因为这不是我的个人计算机。

如您所知，我使用 Scite 或 Notepad++，但两者都无法正确转换。例如，如果我打开一个包含字符“ž”的捷克语文档，然后转到 Notepad++ 中的“转换为 UTF-8”选项，它会错误地将其转换为不可读的字符。

有一种方法可以转换它们，但很乏味。如果我打开带有特殊字符的文档并将文档复制到Windows剪贴板，然后将其粘贴到UTF-8文档中并保存，就可以了。对于我拥有的文档数量来说，这太乏味了（打开每个文件并复制/粘贴到新文档中）。

有什么想法吗？谢谢！！！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷途知返 2024-09-05 08:56:38

如果包含字符“ž”，则编码绝对不是 ISO-8859-1（“Latin 1”），而可能是 CP1252（“Win Latin 1”）。处理 UTF8、ISO-8859-1 和 CP1252 的混合（甚至可能在同一个文件中）正是 Encoding::FixLatin Perl 模块的设计目的。

您可以通过运行以下命令从 CPAN 安装该模块：

perl -MCPAN -e "install 'Encoding::FixLatin'"

然后您可以编写一个使用 Encoding::FixLatin 模块的简短 Perl 脚本，但还有一种更简单的方法。该模块附带一个名为 fix_latin 的命令，该命令在标准输入上采用混合编码并在标准输出上写入 UTF8。因此，您可以使用这样的命令行来转换一个文件：

fix_latin <input-file.txt >output-file.txt

如果您运行的是 Windows，则 fix_latin 命令可能不在您的路径中，并且可能未通过 pl2bat 运行，在这种情况下，您需要执行以下操作：

perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt

需要根据您的系统调整确切的路径和文件名。

要在一大堆文件上运行 fix_latin 在 Linux 系统上是微不足道的，但在 Windows 上您可能需要使用 powershell 或类似的工具。

If the character 'ž' is included then the encoding is definitely not ISO-8859-1 ("Latin 1") but is probably CP1252 ("Win Latin 1"). Dealing with a mix of UTF8, ISO-8859-1 and CP1252 (possibly even in the same file) is exactly what the Encoding::FixLatin Perl module is designed for.

You can install the module from CPAN by running this command:

perl -MCPAN -e "install 'Encoding::FixLatin'"

You could then write a short Perl script that uses the Encoding::FixLatin module, but there's an even easier way. The module comes with a command called fix_latin which takes mixed encoding on standard input and writes UTF8 on standard output. So you could use a command line like this to convert one file:

fix_latin <input-file.txt >output-file.txt

If you're running Windows then the fix_latin command might not be in your path and might not have been run through pl2bat in which case you'd need to do something like:

perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt

The exact paths and filenames would need to be adjusted for your system.

To run fix_latin across a whole bunch of files would be trivial on a Linux system but on Windows you'd probably need to use the powershell or similar.

回复收藏 0 原文

一张白纸 2024-09-05 08:56:38

我不确定这是否是您特定问题的有效答案，但是您是否查看过 GNU iconv 工具？它相当普遍。

回复收藏 0 原文

话少情深 2024-09-05 08:56:38

如果您有权访问 cygwin 或能够下载一些常见的 *nix 工具（您将需要 bash、grep、iconv 和 file，所有这些都可以通过 gnuwin32)，您也许可以编写一个相当简单的 shell 脚本来完成这项工作。

该脚本大约如下所示：

for f in *;
do
   if file $f | grep 'ISO-8859' > /dev/null;
   then
      cat $f | iconv -f iso-8859-1 -t utf-8 > $f.converted;
   else
      echo "Not converting $f"
   fi;
done;

不过，您需要测试这些步骤，例如，我不确定 ISO-8859 文档的“文件”到底是什么。

If you have access to cygwin or are able to download a couple of common *nix tools (you'll need bash, grep, iconv and file, all of which are available for windows via, say, gnuwin32), you might be able to write a rather simple shell script that does the job.

The script would approximately look as follows:

for f in *;
do
   if file $f | grep 'ISO-8859' > /dev/null;
   then
      cat $f | iconv -f iso-8859-1 -t utf-8 > $f.converted;
   else
      echo "Not converting $f"
   fi;
done;

You'll need to test the steps though, e.g. I'm not sure what would "file" exactly say for a ISO-8859 document.

回复收藏 0 原文

~没有更多了~