递归地“标准化”文件名

发布于 2024-10-12 04:50:15 字数 1590 浏览 9 评论 0原文

我的意思是摆脱文件名中的特殊字符等。

我制作了一个脚本,可以递归地重命名文件[http://pastebin.com/raw.php?i=kXeHbDQw]:

例如:之前:

THIS i.s my file (1).txt

运行脚本之后:

This-i-s-my-file-1.txt

好的。这是:

但是:当我想“完全”测试它时,文件名如下:

¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÂÃÄÅÆÇÈÊËÌÎÏÐÑÒÔÕ×ØÙUÛUÝÞßàâãäåæçèêëìîïðñòôõ÷øùûýþÿ.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?@[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt

它失败了[http://pastebin.com/raw.php?i=iu8Pwrnr]:

$ sh renamer.sh directorythathasthefiles
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†....and so on
$

所以“mv”无法处理特殊的字符..:\

我花了很多时间研究它..

有人有一个可用的吗? [也可以处理那两行中的字符[文件名]?]

i mean getting rid of special chars in filenames, etc.

i have made a script, that can recursively rename files [http://pastebin.com/raw.php?i=kXeHbDQw]:

e.g.: before:

THIS i.s my file (1).txt

after running the script:

This-i-s-my-file-1.txt

Ok. here it is:

But: when i wanted to test it "fully", with filenames like this:

¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÂÃÄÅÆÇÈÊËÌÎÏÐÑÒÔÕ×ØÙUÛUÝÞßàâãäåæçèêëìîïðñòôõ÷øùûýþÿ.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?@[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt

it fails [http://pastebin.com/raw.php?i=iu8Pwrnr]:

$ sh renamer.sh directorythathasthefiles
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†....and so on
$

so "mv" can't handle special chars.. :\

i worked on it for many hours..

does anyone has a working one? [that can handle chars [filenames] in that 2 lines too?]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

翻身的咸鱼 2024-10-19 04:50:15

mv 可以很好地处理特殊字符。你的脚本没有。


排名不分先后:

  1. 您正在使用 find 查找所有目录,并使用 ls 分别查找每个目录。

    1. 如果使用一个命令可以完全执行相同的操作,为什么要使用DEPTH in...

      find -max深度 100 -type d
      
    2. 这使得任意深度限制变得不必要

      find -type d
      
    3. 永远不要解析 ls 的输出,尤其如果你可以让 find也处理这个问题

      find -not -type d
      
    4. 确保它在最坏的情况下也能正常工作:

      find -not -type d -print0 |同时读取 -r -d '' 文件名;做
      

      这可以阻止 read 吃掉某些转义符并阻止带有换行符的文件名。

  2. 您正在重复整个 ls |替换每个字符的循环不要 - 它会降低性能。循环遍历每个目录的所有文件一次,并且仅使用多个sed,或者在一个sed中进行多个替换命令。

    sed 's/á/a/g; s/í/i/g; ...'
    

    (我本来打算建议 sed 'y/áí/ai/',但不幸的是,这似乎不适用于 Unicode。也许 perl -CS -Mutf8 -pe ' y/áí/ai/' 会。)

  3. 您仍在用 ASCII 进行思考:“其他特殊字符 - ASCII 代码 33.. ..255”。不要。

    1. 如今,大多数系统都在 UTF-8 编码中使用 Unicode,这种编码具有更多范围的“特殊”字符 - 如此之大以至于将它们一一列出变得毫无意义。 (甚至是多字节 - “e”是一个字节,“ė”是三个字节。)

    2. 真正的 ASCII 有 128 个字符。您当前想到的是 ISO 8859 字符集(有时称为“ANSI”) - 特别是 ISO 8859-1。但它们一直到 8859-16,只有“ASCII”部分保持不变。

  4. echo -n $(command) 相当无用。

  5. 有更简单的方法来查找给定路径的目录和基本名称。例如,您可以这样做

    目录=$(目录名“$path”)
    oldnname=$(基本名称“$path”)
    # 过滤 $oldname
    mv "$path" "$directory/$newname"
    
  6. 不要使用egrep来检查错误。检查程序的返回码。 (就像您已经使用 cd 所做的那样。)

  7. 不要过滤掉其他错误,而是...

    if [[ -e $directory/$newname ]];然后
        echo "目标已存在,跳过:$oldname -> $newname"
        继续
    别的
        mv "$path" "$directory/$newname"
    菲
    
  8. 大量的 sed 's/------------/-/g' 调用可以更改为单个正则表达式:

    sed -r 's/-{2,}/-/g'
    
  9. tr [foo] [bar] 中的 [ ] 是不必要的。它们只是导致 tr[ 替换为 [,并将 ] 替换为 ]

  10. 认真的吗?

    echo "$FOLDERNAME" | 
    echo "$FOLDERNAME" | sed“s/$/\//g”
    

    这个怎么样?

    echo "$FOLDERNAME/"
    


最后,使用排毒

mv handles special characters just fine. Your script doesn't.


In no particular order:

  1. You are using find to find all directories, and ls each directory separately.

    1. Why use for DEPTH in... if you can do exactly the same with one command?

      find -maxdepth 100 -type d
      
    2. Which makes the arbitrary depth limit unnecessary

      find -type d
      
    3. Don't ever parse the output of ls, especially if you can let find handle that, too

      find -not -type d
      
    4. Make sure it works in the worst possible case:

      find -not -type d -print0 | while read -r -d '' FILENAME; do
      

      This stops read from eating certain escapes and choking on filenames with new-line characters.

  2. You are repeating the entire ls | replace cycle for every single character. Don't - it kills performance. Loop over each directory all files once, and just use multiple sed's, or multiple replacements in one sed command.

    sed 's/á/a/g; s/í/i/g; ...'
    

    (I was going to suggest sed 'y/áí/ai/', but unfortunately that doesn't seem to work with Unicode. Perhaps perl -CS -Mutf8 -pe 'y/áí/ai/' would.)

  3. You're still thinking in ASCII: "other special chars - ASCII Codes 33.. ..255". Don't.

    1. These days, most systems use Unicode in UTF-8 encoding, which has a much wider range of "special" characters - so big that listing them out one by one becomes pointless. (It is even multibyte - "e" is one byte, "ė" is three bytes.)

    2. True ASCII has 128 characters. What you currently have in mind are the ISO 8859 character sets (sometimes called "ANSI") - in particular, ISO 8859-1. But they go all the way up to 8859-16, and only the "ASCII" part stays the same.

  4. echo -n $(command) is rather useless.

  5. There are much easier ways to find the directory and basename given a path. For example, you can do

    directory=$(dirname "$path")
    oldnname=$(basename "$path")
    # filter $oldname
    mv "$path" "$directory/$newname"
    
  6. Do not use egrep to check for errors. Check the program's return code. (Like you already do with cd.)

  7. And instead of filtering out other errors, do...

    if [[ -e $directory/$newname ]]; then
        echo "target already exists, skipping: $oldname -> $newname"
        continue
    else
        mv "$path" "$directory/$newname"
    fi
    
  8. The ton of sed 's/------------/-/g' calls can be changed to a single regexp:

    sed -r 's/-{2,}/-/g'
    
  9. The [ ]s in tr [foo] [bar] are unnecessary. They just cause tr to replace [ to [, and ] to ].

  10. Seriously?

    echo "$FOLDERNAME" | sed "s/$/\//g"
    

    How about this instead?

    echo "$FOLDERNAME/"
    

And finally, use detox.

小伙你站住 2024-10-19 04:50:15

尝试如下操作:

find . -print0 -type f | awk 'BEGIN {RS="\x00"} { printf "%s\x00", $0; gsub("[^[:alnum:]]", "-"); printf "%s\0", $0 }' | xargs -0 -L 2 mv

使用 xargs(1) 将确保每个文件名完全作为一个参数传递。 awk(1) 用于在旧文件名之后添加新文件名。

还有一个技巧: sed -e 's/-+/-/g' 会将多个“-”组替换为恰好一个。

Try something like:

find . -print0 -type f | awk 'BEGIN {RS="\x00"} { printf "%s\x00", $0; gsub("[^[:alnum:]]", "-"); printf "%s\0", $0 }' | xargs -0 -L 2 mv

Use of xargs(1) will ensure that each filename passed exactly as one parameter. awk(1) is used to add new filename right after old one.

One more trick: sed -e 's/-+/-/g' will replace groups of more than one "-" with exactly one.

夕色琉璃 2024-10-19 04:50:15

假设您的脚本的其余部分是正确的,您的问题是您正在使用 read 但您应该使用 read -r。注意反斜杠是如何消失的:

áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?@[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£

Assuming the rest of your script is right, your problem is that you are using read but you should use read -r. Notice how the backslash disappeared:

áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?@[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?@[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£
星光不落少年眉 2024-10-19 04:50:15

呃...

清理脚本的一些技巧:

** 使用 sed 一次对多个字符进行翻译,这将清理内容并使其更易于管理:

dev:~$ echo 'áàaieeé!.txt' | sed -e 's/[áàã]/a/g; s/[éè]/e/g'
aaaieee!.txt

** 而不是为每个重命名文件 更改,运行所有过滤器,然后执行一个操作

$ NEWNAME='áàaieeé!.txt'
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/[áàã]/a/g; s/[éè]/e/g')"
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/aa*/a/g')"
$ echo $NEWNAME
aieee!.txt

**,而不是执行 ls | read ... 循环,使用:

for OLDNAME in $DIR/*; do
  blah
  blah
  blah
done

** 将路径遍历和重命名逻辑分离到两个脚本中。一个脚本查找需要重命名的文件,一个脚本处理单个文件的规范化。一旦你学会了“查找”命令,你就会意识到你可以扔掉第一个脚本:)

Ugh...

Some tips to clean up your script:

** Use sed to do translation on multiple characters at once, that'll clean things up and make it easier to manage:

dev:~$ echo 'áàaieeé!.txt' | sed -e 's/[áàã]/a/g; s/[éè]/e/g'
aaaieee!.txt

** rather than renaming the file for each change, run all your filters then do one move

$ NEWNAME='áàaieeé!.txt'
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/[áàã]/a/g; s/[éè]/e/g')"
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/aa*/a/g')"
$ echo $NEWNAME
aieee!.txt

** rather than doing a ls | read ... loop, use:

for OLDNAME in $DIR/*; do
  blah
  blah
  blah
done

** separate out your path traversal and renaming logic into two scripts. One script finds the files which need to be renamed, one script handles the normalization of a single file. Once you learn the 'find' command, you'll realize you can toss the first script :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文