如何使用 Perl 将一堆文件从 ISO-8859-1 转换为 UTF-8?
我有几个文档需要从 ISO-8859-1 转换为 UTF-8(当然没有 BOM)。但这就是问题所在。我有很多这样的文档(它实际上是文档的混合,一些 UTF-8 和一些 ISO-8859-1),我需要一种自动转换它们的方法。不幸的是我只安装了ActivePerl并且不太了解该语言的编码。我也许可以安装 PHP,但我不确定,因为这不是我的个人计算机。
如您所知,我使用 Scite 或 Notepad++,但两者都无法正确转换。例如,如果我打开一个包含字符“ž”的捷克语文档,然后转到 Notepad++ 中的“转换为 UTF-8”选项,它会错误地将其转换为不可读的字符。
有一种方法可以转换它们,但很乏味。如果我打开带有特殊字符的文档并将文档复制到Windows剪贴板,然后将其粘贴到UTF-8文档中并保存,就可以了。对于我拥有的文档数量来说,这太乏味了(打开每个文件并复制/粘贴到新文档中)。
有什么想法吗? 谢谢!!!
I have several documents I need to convert from ISO-8859-1 to UTF-8 (without the BOM of course). This is the issue though. I have so many of these documents (it is actually a mix of documents, some UTF-8 and some ISO-8859-1) that I need an automated way of converting them. Unfortunately I only have ActivePerl installed and don't know much about encoding in that language. I may be able to install PHP, but I am not sure as this is not my personal computer.
Just so you know, I use Scite or Notepad++, but both do not convert correctly. For example, if I open a document in Czech that contains the character "ž" and go to the "Convert to UTF-8" option in Notepad++, it incorrectly converts it to an unreadable character.
There is a way I CAN convert them, but it is tedious. If I open the document with the special characters and copy the document to Windows clipboard, then paste it into a UTF-8 document and save it, it is okay. This is too tedious (opening every file and copying/pasting into a new document) for the amount of documents I have.
Any ideas?
Thanks!!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果包含字符“ž”,则编码绝对不是 ISO-8859-1(“Latin 1”),而可能是 CP1252(“Win Latin 1”)。处理 UTF8、ISO-8859-1 和 CP1252 的混合(甚至可能在同一个文件中)正是 Encoding::FixLatin Perl 模块的设计目的。
您可以通过运行以下命令从 CPAN 安装该模块:
然后您可以编写一个使用 Encoding::FixLatin 模块的简短 Perl 脚本,但还有一种更简单的方法。该模块附带一个名为
fix_latin
的命令,该命令在标准输入上采用混合编码并在标准输出上写入 UTF8。因此,您可以使用这样的命令行来转换一个文件:如果您运行的是 Windows,则 fix_latin 命令可能不在您的路径中,并且可能未通过 pl2bat 运行,在这种情况下,您需要执行以下操作:
需要根据您的系统调整确切的路径和文件名。
要在一大堆文件上运行
fix_latin
在 Linux 系统上是微不足道的,但在 Windows 上您可能需要使用 powershell 或类似的工具。If the character 'ž' is included then the encoding is definitely not ISO-8859-1 ("Latin 1") but is probably CP1252 ("Win Latin 1"). Dealing with a mix of UTF8, ISO-8859-1 and CP1252 (possibly even in the same file) is exactly what the Encoding::FixLatin Perl module is designed for.
You can install the module from CPAN by running this command:
You could then write a short Perl script that uses the Encoding::FixLatin module, but there's an even easier way. The module comes with a command called
fix_latin
which takes mixed encoding on standard input and writes UTF8 on standard output. So you could use a command line like this to convert one file:If you're running Windows then the fix_latin command might not be in your path and might not have been run through pl2bat in which case you'd need to do something like:
The exact paths and filenames would need to be adjusted for your system.
To run
fix_latin
across a whole bunch of files would be trivial on a Linux system but on Windows you'd probably need to use the powershell or similar.我不确定这是否是您特定问题的有效答案,但是您是否查看过 GNU iconv 工具?它相当普遍。
I'm not sure if this is a valid answer to your particular question, but have you looked at the GNU iconv tool? It's fairly generally available.
如果您有权访问 cygwin 或能够下载一些常见的 *nix 工具(您将需要 bash、grep、iconv 和 file,所有这些都可以通过 gnuwin32),您也许可以编写一个相当简单的 shell 脚本来完成这项工作。
该脚本大约如下所示:
不过,您需要测试这些步骤,例如,我不确定 ISO-8859 文档的“文件”到底是什么。
If you have access to cygwin or are able to download a couple of common *nix tools (you'll need bash, grep, iconv and file, all of which are available for windows via, say, gnuwin32), you might be able to write a rather simple shell script that does the job.
The script would approximately look as follows:
You'll need to test the steps though, e.g. I'm not sure what would "file" exactly say for a ISO-8859 document.