如何使用 Perl 将一堆文件从 ISO-8859-1 转换为 UTF-8?

发布于 2024-08-29 08:56:38 字数 442 浏览 7 评论 0原文

我有几个文档需要从 ISO-8859-1 转换为 UTF-8(当然没有 BOM)。但这就是问题所在。我有很多这样的文档(它实际上是文档的混合,一些 UTF-8 和一些 ISO-8859-1),我需要一种自动转换它们的方法。不幸的是我只安装了ActivePerl并且不太了解该语言的编码。我也许可以安装 PHP,但我不确定,因为这不是我的个人计算机。

如您所知,我使用 Scite 或 Notepad++,但两者都无法正确转换。例如,如果我打开一个包含字符“ž”的捷克语文档,然后转到 Notepad++ 中的“转换为 UTF-8”选项,它会错误地将其转换为不可读的字符。

有一种方法可以转换它们,但很乏味。如果我打开带有特殊字符的文档并将文档复制到Windows剪贴板,然后将其粘贴到UTF-8文档中并保存,就可以了。对于我拥有的文档数量来说,这太乏味了(打开每个文件并复制/粘贴到新文档中)。

有什么想法吗? 谢谢!!!

I have several documents I need to convert from ISO-8859-1 to UTF-8 (without the BOM of course). This is the issue though. I have so many of these documents (it is actually a mix of documents, some UTF-8 and some ISO-8859-1) that I need an automated way of converting them. Unfortunately I only have ActivePerl installed and don't know much about encoding in that language. I may be able to install PHP, but I am not sure as this is not my personal computer.

Just so you know, I use Scite or Notepad++, but both do not convert correctly. For example, if I open a document in Czech that contains the character "ž" and go to the "Convert to UTF-8" option in Notepad++, it incorrectly converts it to an unreadable character.

There is a way I CAN convert them, but it is tedious. If I open the document with the special characters and copy the document to Windows clipboard, then paste it into a UTF-8 document and save it, it is okay. This is too tedious (opening every file and copying/pasting into a new document) for the amount of documents I have.

Any ideas?
Thanks!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迷途知返 2024-09-05 08:56:38

如果包含字符“ž”,则编码绝对不是 ISO-8859-1(“Latin 1”),而可能是 CP1252(“Win Latin 1”)。处理 UTF8、ISO-8859-1 和 CP1252 的混合(甚至可能在同一个文件中)正是 Encoding::FixLatin Perl 模块的设计目的。

您可以通过运行以下命令从 CPAN 安装该模块:

perl -MCPAN -e "install 'Encoding::FixLatin'"

然后您可以编写一个使用 Encoding::FixLatin 模块的简短 Perl 脚本,但还有一种更简单的方法。该模块附带一个名为 fix_latin 的命令,该命令在标准输入上采用混合编码并在标准输出上写入 UTF8。因此,您可以使用这样的命令行来转换一个文件:

fix_latin <input-file.txt >output-file.txt

如果您运行的是 Windows,则 fix_latin 命令可能不在您的路径中,并且可能未通过 pl2bat 运行,在这种情况下,您需要执行以下操作:

perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt

需要根据您的系统调整确切的路径和文件名。

要在一大堆文件上运行 fix_latin 在 Linux 系统上是微不足道的,但在 Windows 上您可能需要使用 powershell 或类似的工具。

If the character 'ž' is included then the encoding is definitely not ISO-8859-1 ("Latin 1") but is probably CP1252 ("Win Latin 1"). Dealing with a mix of UTF8, ISO-8859-1 and CP1252 (possibly even in the same file) is exactly what the Encoding::FixLatin Perl module is designed for.

You can install the module from CPAN by running this command:

perl -MCPAN -e "install 'Encoding::FixLatin'"

You could then write a short Perl script that uses the Encoding::FixLatin module, but there's an even easier way. The module comes with a command called fix_latin which takes mixed encoding on standard input and writes UTF8 on standard output. So you could use a command line like this to convert one file:

fix_latin <input-file.txt >output-file.txt

If you're running Windows then the fix_latin command might not be in your path and might not have been run through pl2bat in which case you'd need to do something like:

perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt

The exact paths and filenames would need to be adjusted for your system.

To run fix_latin across a whole bunch of files would be trivial on a Linux system but on Windows you'd probably need to use the powershell or similar.

一张白纸 2024-09-05 08:56:38

我不确定这是否是您特定问题的有效答案,但是您是否查看过 GNU iconv 工具?它相当普遍。

I'm not sure if this is a valid answer to your particular question, but have you looked at the GNU iconv tool? It's fairly generally available.

话少情深 2024-09-05 08:56:38

如果您有权访问 cygwin 或能够下载一些常见的 *nix 工具(您将需要 bash、grep、iconv 和 file,所有这些都可以通过 gnuwin32),您也许可以编写一个相当简单的 shell 脚本来完成这项工作。

该脚本大约如下所示:

for f in *;
do
   if file $f | grep 'ISO-8859' > /dev/null;
   then
      cat $f | iconv -f iso-8859-1 -t utf-8 > $f.converted;
   else
      echo "Not converting $f"
   fi;
done;

不过,您需要测试这些步骤,例如,我不确定 ISO-8859 文档的“文件”到底是什么。

If you have access to cygwin or are able to download a couple of common *nix tools (you'll need bash, grep, iconv and file, all of which are available for windows via, say, gnuwin32), you might be able to write a rather simple shell script that does the job.

The script would approximately look as follows:

for f in *;
do
   if file $f | grep 'ISO-8859' > /dev/null;
   then
      cat $f | iconv -f iso-8859-1 -t utf-8 > $f.converted;
   else
      echo "Not converting $f"
   fi;
done;

You'll need to test the steps though, e.g. I'm not sure what would "file" exactly say for a ISO-8859 document.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文