使用 iconv 将 latin-1 文件批量转换为 utf-8

发布于 2024-10-09 06:33:32 字数 265 浏览 6 评论 0原文

我在我的 OSX 上有一个 PHP 项目,它采用 latin1 编码。现在我需要将文件转换为UTF8。我不是一个 shell 编码员,我尝试了从互联网上找到的一些东西:

mkdir new  
for a in `ls -R *`; do iconv -f iso-8859-1 -t utf-8 <"$a" >new/"$a" ; done

但这不会创建目录结构,并且在运行时会给我带来大量错误。任何人都可以想出一个巧妙的解决方案吗?

I'm having this one PHP project on my OSX which is in latin1 -encoding. Now I need to convert files to UTF8. I'm not much a shell coder and I tried something I found from internet:

mkdir new  
for a in `ls -R *`; do iconv -f iso-8859-1 -t utf-8 <"$a" >new/"$a" ; done

But that does not create the directory structure and it gives me heck load of errors when run. Can anyone come up with neat solution?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

豆芽 2024-10-16 06:33:33

上面的答案一切都很好,但如果这是一个“混合”项目,即已经有UTF8文件,那么我们可能会遇到麻烦,因此这是我的解决方案,我首先检查文件编码。

#!/bin/bash
# file name: to_utf8

# current encoding:
encoding=$(file -i "$1" | sed "s/.*charset=\(.*\)$/\1/")

if [  "${encoding}" = "iso-8859-1" ] || [ "${encoding}" = "iso-8859-2" ]; 
then
echo "recoding from ${encoding} to UTF-8 file : $1"
recode ISO-8859-2..UTF-8 "$1"
fi

#example:
#find . -name "*.php" -exec to_utf8 {} \;

Everything's fine with the above answers, but if this is a "mixed" project, i.e. there are already UTF8 files, then we may get into trouble, therefore here's my solution, I'm checking file encoding first.

#!/bin/bash
# file name: to_utf8

# current encoding:
encoding=$(file -i "$1" | sed "s/.*charset=\(.*\)$/\1/")

if [  "${encoding}" = "iso-8859-1" ] || [ "${encoding}" = "iso-8859-2" ]; 
then
echo "recoding from ${encoding} to UTF-8 file : $1"
recode ISO-8859-2..UTF-8 "$1"
fi

#example:
#find . -name "*.php" -exec to_utf8 {} \;
海拔太高太耀眼 2024-10-16 06:33:33

在 Windows Git Bash 上,我在使用几个建议的解决方案时遇到了这些错误:

  • find: 只有一个 {} 实例受 -exec ... +
  • find: In '-exec 的支持。 .. {} +' '{}' 必须单独出现,但您指定了 'source={};...'

但这(其他建议的解决方案的混合)有效:

for fileToConvert in $(find . -type f -name \*.js); do iconv -f iso-8859-1 -t utf-8 <"$fileToConvert" >~/temp-iconv.txt ; mv -f ~/temp-iconv.txt "$fileToConvert" ; done

On Windows Git Bash, I got these errors with several of the proposed solutions:

  • find: Only one instance of {} is supported with -exec ... +
  • find: In ‘-exec ... {} +’ the ‘{}’ must appear by itself, but you specified ‘source={};...’

But that (a mix of other proposed solutions) worked:

for fileToConvert in $(find . -type f -name \*.js); do iconv -f iso-8859-1 -t utf-8 <"$fileToConvert" >~/temp-iconv.txt ; mv -f ~/temp-iconv.txt "$fileToConvert" ; done
回忆躺在深渊里 2024-10-16 06:33:33

在 iconv 之前使用 mkdir -p "${a%/*}";

请注意,当文件名中有空格时,您正在使用潜在危险的 for 结构,请参阅 http://porkmail.org/era/unix/award.html

Use mkdir -p "${a%/*}"; before iconv.

Note that you are using a potentially dangerous for construct when there are spaces in filenames, see http://porkmail.org/era/unix/award.html.

多彩岁月 2024-10-16 06:33:33

使用 Dennis Williamson 和 Alberto Zaccagni 的答案,我想出了以下脚本,用于转换所有子目录中指定文件类型的所有文件。然后将输出收集到由 /path/to/destination 指定的一个文件夹中

mkdir /path/to/destination
for a in $(find . -name "*.php"); 
do 
        filename=$(basename $a);
        echo $filename
        iconv -f iso-8859-1 -t utf-8 <"$a" >"/path/to/destination/$filename"; 
done

。 函数 basename 返回不带文件路径的文件名。

替代方案(用户交互):
现在,我还创建了一个用户交互式脚本,让您决定是要覆盖旧文件还是只是重命名它们。另外感谢 tbsalling 祝

for a in $(find . -name "*.tex");
do
        iconv -f iso-8859-1 -t utf-8 <"$a" >"$a".utf8 ;
done
echo "Should the original files be replaced (Y/N)?"
read replace
if [ "$replace" == "Y" ]; then
    echo "Original files have been replaced."
    for a in $(find . -name "*.tex.utf8");
        do
            file_no_suffix=$(basename -s .tex.utf8 "$a");
            directory=$(dirname "$a");
            mv "$a" "$directory"/"$file_no_suffix".tex;
        done
else
        echo "Original files have been converted and converted files were saved with suffix '.utf8'"
fi

你玩得开心,我将不胜感激任何改进它的评论,谢谢!

Using the answers of Dennis Williamson and Alberto Zaccagni, I came up with the following script that converts all files of the specified file type from all subdirectories. The output is then collected in one folder that is given by /path/to/destination

mkdir /path/to/destination
for a in $(find . -name "*.php"); 
do 
        filename=$(basename $a);
        echo $filename
        iconv -f iso-8859-1 -t utf-8 <"$a" >"/path/to/destination/$filename"; 
done

The function basename returns the filename without the path of the file.

Alternative (user interactive):
Now I also created a user interactive script that lets you decide whether you want to overwrite the old files or just rename them. Additional thanks go to tbsalling

for a in $(find . -name "*.tex");
do
        iconv -f iso-8859-1 -t utf-8 <"$a" >"$a".utf8 ;
done
echo "Should the original files be replaced (Y/N)?"
read replace
if [ "$replace" == "Y" ]; then
    echo "Original files have been replaced."
    for a in $(find . -name "*.tex.utf8");
        do
            file_no_suffix=$(basename -s .tex.utf8 "$a");
            directory=$(dirname "$a");
            mv "$a" "$directory"/"$file_no_suffix".tex;
        done
else
        echo "Original files have been converted and converted files were saved with suffix '.utf8'"
fi

Have fun with this and I would be grateful for any comments to improve it, thanks!

眸中客 2024-10-16 06:33:33
find . -iname "*.php" | xargs -I {} echo "iconv -f ISO-8859-1 -t UTF-8 \"{}\" > \"{}-utf8.php\""
find . -iname "*.php" | xargs -I {} echo "iconv -f ISO-8859-1 -t UTF-8 \"{}\" > \"{}-utf8.php\""
纵情客 2024-10-16 06:33:32

您不应该像这样使用 ls 并且 for 循环也不合适。此外,目标目录应该位于源目录之外。

mkdir /path/to/destination
find . -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o /path/to/destination/"{}" \;

不需要循环。 -type f 选项包含文件并排除目录。

编辑:

OS X 版本的 iconv 没有 -o 选项。试试这个:

find . -type f -exec bash -c 'iconv -f iso-8859-1 -t utf-8 "{}" > /path/to/destination/"{}"' \;

You shouldn't use ls like that and a for loop is not appropriate either. Also, the destination directory should be outside the source directory.

mkdir /path/to/destination
find . -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o /path/to/destination/"{}" \;

No need for a loop. The -type f option includes files and excludes directories.

Edit:

The OS X version of iconv doesn't have the -o option. Try this:

find . -type f -exec bash -c 'iconv -f iso-8859-1 -t utf-8 "{}" > /path/to/destination/"{}"' \;
养猫人 2024-10-16 06:33:32

这将转换当前目录及其子目录中带有 .php 文件扩展名的所有文件 - 保留目录结构:

find . -name "*.php" -exec sh -c "iconv -f ISO-8859-1 -t UTF-8 '{}' > '{}'.utf8"  \; -exec sh -c "mv '{}.utf8' '{}'" \;

注意:

要预先获取目标文件的列表,只需运行不带任何内容的命令即可-exec 标志(如下所示:find . -name "*.php")。进行备份是个好主意。

像这样使用 sh 允许使用 -exec 进行管道和重定向,这是必要的,因为并非所有版本的 iconv 都支持 -o 标志。

.utf8 添加到输出的文件名中,然后将其删除可能看起来很奇怪,但这是必要的。对输出和输入文件使用相同的名称可能会导致以下问题:

  • 对于大文件(根据我的经验,大约 30 KB),它会导致核心转储(或由信号 7 终止

  • 某些版本的 iconv 似乎在读取输入文件之前创建输出文件,这意味着如果输入和输出文件具有相同的名称,则在读取输入文件之前会用空文件覆盖输入文件。

This converts all files with the .php filename extension - in the current directory and its subdirectories - preserving the directory structure:

find . -name "*.php" -exec sh -c "iconv -f ISO-8859-1 -t UTF-8 '{}' > '{}'.utf8"  \; -exec sh -c "mv '{}.utf8' '{}'" \;

Notes:

To get a list of files that will be targeted beforehand, just run the command without the -exec flags (like this: find . -name "*.php"). Making a backup is a good idea.

Using sh like this allows piping and redirecting with -exec, which is necessary because not all versions of iconv support the -o flag.

Adding .utf8 to the filename of the output and then removing it might seem strange but it is necessary. Using the same name for output and input files can cause the following problems:

  • For large files (around 30 KB in my experience) it causes core dump (or termination by signal 7)

  • Some versions of iconv seem to create the output-file before they read the input file, which means that if the input and output files have the same name, the input file is overwritten with an empty file before it is read.

妄想挽回 2024-10-16 06:33:32

一些很好的答案,但我发现在我的情况下,使用包含数百个文件的嵌套目录进行转换要容易得多:

警告:这会将文件写入到位,因此请进行备份

$ vim $(find . -type f)

# in vim, go into command mode (:)
:set nomore
:bufdo set fileencoding=utf8 | w

Some good answers, but I found this a lot easier in my case with a nested directory of hundreds of files to convert:

WARNING: This will write the files in place, so make a backup

$ vim $(find . -type f)

# in vim, go into command mode (:)
:set nomore
:bufdo set fileencoding=utf8 | w
耶耶耶 2024-10-16 06:33:32

要将完整的目录树从 iso-8859-1 递归转换为 utf-8(包括创建子目录),上面的简短解决方案都不适合我,因为目录结构不是在目标中创建的。根据 Dennis Williamsons 的回答,我提出了以下解决方案:

find . -type f -exec bash -c 't="/tmp/dest"; mkdir -p "$t/`dirname {}`"; iconv -f iso-8859-1 -t utf-8 "{}" > "$t/{}"' \;

它将在 /tmp/dest 中创建当前目录子树的克隆(根据您的需要进行调整),包括所有子目录和所有 iso-8859-1 文件转换为 utf-8。在 macOS 上测试。

顺便说一句:检查您的文件编码:

file -I file.php

以获取编码信息。

希望这有帮助。

To convert a complete directory tree recursively from iso-8859-1 to utf-8 including the creation of subdirectories none of the short solutions above worked for me because the directory structure was not created in the target. Based on Dennis Williamsons answer I came up with the following solution:

find . -type f -exec bash -c 't="/tmp/dest"; mkdir -p "$t/`dirname {}`"; iconv -f iso-8859-1 -t utf-8 "{}" > "$t/{}"' \;

It will create a clone of the current directory subtree in /tmp/dest (adjust to your needs) including all subdirectories and with all iso-8859-1 files converted to utf-8. Tested on macosx.

Btw: Check your file encodings with:

file -I file.php

to get the encoding information.

Hope this helps.

风吹雨成花 2024-10-16 06:33:32

我创建以下脚本,(i)备份目录“converted”中的所有 tex 文件,(ii)检查每个 tex 文件的编码,以及(iii)仅将 ISO-8859-1 中的 tex 文件转换为 UTF-8编码。

FILES=*.tex
for f in $FILES
do
  filename="${f%.*}"
  echo -n "$f"
#file -I $f
  if file -I $f | grep -wq "iso-8859-1"
  then
    mkdir -p converted
    cp $f ./converted
    iconv -f ISO-8859-1 -t UTF-8 $f > "${filename}_utf8.tex"
    mv "${filename}_utf8.tex" $f
    echo ": CONVERTED TO UTF-8."
  else
    echo ": UTF-8 ALREADY."
  fi
done

I create the following script that (i) backups all tex files in directory "converted", (ii) checks the encoding of every tex file, and (iii) converts to UTF-8 only the tex files in the ISO-8859-1 encoding.

FILES=*.tex
for f in $FILES
do
  filename="${f%.*}"
  echo -n "$f"
#file -I $f
  if file -I $f | grep -wq "iso-8859-1"
  then
    mkdir -p converted
    cp $f ./converted
    iconv -f ISO-8859-1 -t UTF-8 $f > "${filename}_utf8.tex"
    mv "${filename}_utf8.tex" $f
    echo ": CONVERTED TO UTF-8."
  else
    echo ": UTF-8 ALREADY."
  fi
done
下雨或天晴 2024-10-16 06:33:32

如果您必须转换的所有文件都是 .php,您可以使用以下命令,默认情况下是递归的:

for a in $(find . -name "*.php"); do iconv -f iso-8859-1 -t utf-8 <"$a" >new/"$a" ; done

我相信您的错误是由于 ls -R 也会产生可能不会的输出被 iconv 识别为有效的文件名,例如 ./my/dir/struct:

If all the files you have to convert are .php you could use the following, which is recursive by default:

for a in $(find . -name "*.php"); do iconv -f iso-8859-1 -t utf-8 <"$a" >new/"$a" ; done

I believe your errors were due to the fact that ls -R also produces an output that might not be recognized by iconv as a valid filename, something like ./my/dir/structure:

债姬 2024-10-16 06:33:32

在 unix.stackexchange.com 上,有人提出了类似的问题,用户 manatwork 建议重新编码,这非常有效。

我一直在用它来将 ucs-2 转换为 utf-8

recode ucs-2..utf-8 *.txt

On unix.stackexchange.com a similar question was asked, and user manatwork suggested recode which does the trick very nicely.

I've been using it to convert ucs-2 to utf-8 in place

recode ucs-2..utf-8 *.txt
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文