使用 find 或 grep 查找来自不同编码系统（Windows 到 Linux）的带有重音字符的文件名

发布于 2024-10-06 11:00:10 字数 1593 浏览 4 评论 0原文

我试图稍后标记与我类似的问题（Find Non-UTF8 Filenames on Linux File System）以引发进一步的回复，到目前为止还没有运气，所以这里又来了...

我和上面链接中的OP有同样的问题，而 convmv 是一个修复自己的文件系统的好工具。因此，我的问题是学术性的，但我发现“find”无法找到非标准 ascii 字符这一点令人不满意（事实上我无法相信）。

有没有人知道使用什么选项组合来查找似乎是 unicode FS 上包含非标准字符的文件名，在我的例子中，字符似乎是 8 位扩展 ascii 而不是 unicode，这些文件来自Windows 机器（iso-8859-1），我经常需要获取它们。我很想看看 find 和/或 grep 如何做与 convmv 相同的事情。

示例文件：

> ls
Abc�def ÉÈéèáà-rest everest éverest

> ls -b
Abc\251def  ÉÈéèáà-rest  everest  éverest

第一个文件来自 Windows（或使用 touch $(printf "Abc\xA9def") 模拟）。

> find . -regex '.*[^a-zA-Z./].*'
./ÉÈéèáà-rest

> ls | egrep '[^a-zA-Z]'
ÉÈéèáà-rest

几乎丢失了所有这些（连字符保存了该文件，可以使用彩色 grep 看到）。这里发生的事情都不是我所期望的：find 和 grep 都无法将重音字母视为超出提供的范围 [^a-zA-Z./]。

> find . -regex '.*é.*'
./éverest
./ÉÈéèáà-rest

> ls | egrep 'é'
ÉÈéèáà-rest
éverest

> ls | egrep '[é]'
ÉÈéèáà-rest
éverest

> find . -regex '.*[é].*'
./éverest
./ÉÈéèáà-rest

奇怪的是，当提供时（包括在范围内），两者都能够拾取标准口音。任何使用 \xA9、\0251 或 \o251 的 find 或 grep 尝试都会失败（不匹配）。

> ls | fgrep e
Abc�def
ÉÈéèáà-rest
everest
éverest

寻找无争议的字符会显示所有带有 grep 的文件，正如我所期望的那样。

> find . -regex '.*e.*'
./éverest
./ÉÈéèáà-rest
./everest

> find . -name '*e*'
./éverest
./ÉÈéèáà-rest
./everest

然而， find 是非常具有歧视性的：即使查找正常字符，在我看来，它也会消除包含超出文件系统名称编码模式可接受字符范围的字符的文件名。

就我而言，如果该文件位于文件系统中，那么 find 应该找到它，对吗？但也许有一个我不知道的功能？

任何见解将非常感激。

原文

I tried to tag late onto a question similar to mine (Find Non-UTF8 Filenames on Linux File System) to elicit further replies, with no luck so far, so here goes again...

I have the same problem as the OP in the link above and convmv is a great tool to fix one's own filesystem. My question is therefore academic, but I find it unsatisfactory (in fact I can't believe) that 'find' is not able to find non standard ascii characters.

Is there anyone out there that would know what combination of options to use to find filenames that contain non standard characters on what seems to be a unicode FS, in my case the characters seem to be 8bits extended ascii rather than unicode, the files come from a Windows machine (iso-8859-1) and I regularly need to fetch them. I'd love to see how find and/or grep can do the same as convmv.

Sample files:

> ls
Abc�def ÉÈéèáà-rest everest éverest

> ls -b
Abc\251def  ÉÈéèáà-rest  everest  éverest

First file comes from Windows (or simulated with touch $(printf "Abc\xA9def")).

> find . -regex '.*[^a-zA-Z./].*'
./ÉÈéèáà-rest

> ls | egrep '[^a-zA-Z]'
ÉÈéèáà-rest

Missing almost all of them (the hyphen saved that file, can be seen with coloured grep). Whatever is happening here is not what I would expect: neither find nor grep is able to take an accented letter as being outside the range provided [^a-zA-Z./].

> find . -regex '.*é.*'
./éverest
./ÉÈéèáà-rest

> ls | egrep 'é'
ÉÈéèáà-rest
éverest

> ls | egrep '[é]'
ÉÈéèáà-rest
éverest

> find . -regex '.*[é].*'
./éverest
./ÉÈéèáà-rest

Bizarrely both are able to pick up a standard accent when provided (including in the range). Any find or grep trial with \xA9, \0251 or \o251 fails (no match).

> ls | fgrep e
Abc�def
ÉÈéèáà-rest
everest
éverest

Looking for a non-controversial character shows all files with grep, as I would have expected.

> find . -regex '.*e.*'
./éverest
./ÉÈéèáà-rest
./everest

> find . -name '*e*'
./éverest
./ÉÈéèáà-rest
./everest

find, however, is very discriminatory: even looking up a normal character, it seems to me that it eliminates filenames that contain characters outside the range of acceptable characters for the filesystem's name encoding schema.

As far as I am concerned if the file is in the filesystem, then find should find it, right? But maybe there's a feature I don't know about?

Any insights would be very much appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

⊕婉儿 2024-10-13 11:00:10

Jander 回答对于我在超级用户上发布的同一问题，

Jander 的回答完美地完成了这项工作，对于那些有兴趣从中获得更多信息的人，这里还有一个提示。

当 LANG=C 时，find 显示带问号的非 ascii 字符。要将其转换回该文件系统的正常显示，只需将输出通过管道传输到 cat.

LANG=C find . -regex '.*[^a-zA-Z./-].*'
./??verest
./????????????-rest
./Abc?def

LANG=C find . -regex '.*[^a-zA-Z./-].*' | cat
./éverest
./ÉÈéèáà-rest
./Abc�def

Jander answered to the same question I posted on Super User

Jander's answer does the job perfectly, for those interested in getting more out of this, here is one more tip.

With LANG=C, find displays non-ascii characters with question marks. To convert that back to their normal display with that file system, just pipe the output to cat.

LANG=C find . -regex '.*[^a-zA-Z./-].*'
./??verest
./????????????-rest
./Abc?def

LANG=C find . -regex '.*[^a-zA-Z./-].*' | cat
./éverest
./ÉÈéèáà-rest
./Abc�def

回复收藏 0 原文

~没有更多了~