自动检测字符编码 (UTF-16) 的 findstr 或 grep

发布于 2024-07-10 21:24:05 字数 283 浏览 13 评论 0原文

我想这样做:

 findstr /s /c:some-symbol *

或 grep 等效项

 grep -R some-symbol *

,但我需要该实用程序来自动检测以 UTF-16 (和朋友)编码的文件并适当地搜索它们。 我的文件中甚至有字节排序标记 FFEE,所以我什至不寻找英雄自动检测。

有什么建议么?


我指的是Windows Vista 和XP。

I want to do this:

 findstr /s /c:some-symbol *

or the grep equivalent

 grep -R some-symbol *

but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark FFEE in them so I'm not even looking for heroic autodetection.

Any suggestions?


I'm referring to Windows Vista and XP.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

西瑶 2024-07-17 21:24:05

解决方法是将 UTF-16 转换为 ASCII 或 ANSI

TYPE UTF-16.txt > ASCII.txt

然后您可以使用 FINDSTR。

FINDSTR object ASCII.txt

A workaround is to convert your UTF-16 to ASCII or ANSI

TYPE UTF-16.txt > ASCII.txt

Then you can use FINDSTR.

FINDSTR object ASCII.txt
不知所踪 2024-07-17 21:24:05

感谢您的建议。 我指的是 Windows Vista 和 XP。

我还发现了这个解决方法,使用免费的 Sysinternals strings.exe

C:\> strings -s -b dir_tree_to_search | grep regexp 

Strings.exe 提取它找到的所有字符串(从二进制文件中提取,但也适用于文本文件)并在每个结果前面添加文件名和冒号,因此请考虑到这一点在正则表达式中(或使用 cut 或管道中的其他步骤)。 -s 使其执行递归提取,而 -b 只是抑制横幅消息。

最终,我仍然对旗舰搜索实用程序 Gnu grep 和 findstr 本身不处理 Unicode 字符编码感到惊讶。

Thanks for the suggestions. I was referring to Windows Vista and XP.

I also discovered this workaround, using free Sysinternals strings.exe:

C:\> strings -s -b dir_tree_to_search | grep regexp 

Strings.exe extracts all of the strings it finds (from binaries, but works fine with text files too) and prepends each result with a filename and colon, so take that into account in the regexp (or use cut or another step in the pipeline). The -s makes it do a recursive extraction and -b just suppresses the banner message.

Ultimately I'm still kind of surprised that the flagship searching utilities Gnu grep and findstr don't handle Unicode character encodings natively.

蓝海 2024-07-17 21:24:05

在 Windows 上,您还可以使用 find.exe。

find /i /n "YourSearchString" *.*

唯一的问题是这会打印文件名,后跟匹配项。 您可以通过管道到 findstr 来过滤它们

find /i /n "YourSearchString" *.* | findstr /i "YourSearchString"

On Windows, you can also use find.exe.

find /i /n "YourSearchString" *.*

The only problem is this prints file names followed by matches. You may filter them by piping to findstr

find /i /n "YourSearchString" *.* | findstr /i "YourSearchString"
把昨日还给我 2024-07-17 21:24:05
findstr /s /c:some-symbol *

可以替换为以下字符编码感知命令:

for /r %f in (*) do @find /i /n "some-symbol" "%f"
findstr /s /c:some-symbol *

can be replaced with the following character encoding aware command:

for /r %f in (*) do @find /i /n "some-symbol" "%f"
私野 2024-07-17 21:24:05

根据Damon Cortesi的博客文章,grep不适用于UTF -16 个文件,如您所知。 然而,它提出了这种解决方法:

for f in `find . -type f | xargs -I {} file {} | grep UTF-16 | cut -f1 -d\:`
        do iconv -f UTF-16 -t UTF-8 $f | grep -iH --label=$f ${GREP_FOR}
done

这显然是针对 Unix 的,不确定 Windows 上的等效项是什么。 该文章的作者还提供了一个 shell 脚本来执行上述操作,您可以在 github 此处找到该脚本

这仅greps UTF-16 文件。 您还可以按正常方式 grep ASCII 文件。

According to this blog article by Damon Cortesi grep doesn't work with UTF-16 files, as you found out. However, it presents this work-around:

for f in `find . -type f | xargs -I {} file {} | grep UTF-16 | cut -f1 -d\:`
        do iconv -f UTF-16 -t UTF-8 $f | grep -iH --label=$f ${GREP_FOR}
done

This is obviously for Unix, not sure what the equivalent on Windows would be. The author of that article also provides a shell-script to do the above that you can find on github here.

This only greps files that are UTF-16. You'd also grep your ASCII files the normal way.

粉红×色少女 2024-07-17 21:24:05

在更高版本的 Windows 中,开箱即用地支持 UTF-16。 如果没有,请尝试通过 chcp 命令更改活动代码页。

就我而言,单独使用 findstr 无法处理 UTF-16 文件,但它可以与 type 配合使用:

type *.* | findstr /s /c:some-symbol

In higher versions of Windows, UTF-16 is supported out-of-box. If not, try changing active code page by chcp command.

In my case when using findstr alone was failing for UTF-16 files, however it worked with type:

type *.* | findstr /s /c:some-symbol
香橙ぽ 2024-07-17 21:24:05

您没有说您想在哪个平台上执行此操作。

在 Windows 上,您可以使用 PowerGREP,它会自动检测以字节顺序标记开头的 Unicode 文件。 (还有一个选项可以自动检测没有 BOM 的文件。自动检测对于 UTF-8 非常可靠,但对于 UTF-16 则有限。)

You didn't say which platform you want to do this on.

On Windows, you could use PowerGREP, which automatically detects Unicode files that start with a byte order mark. (There's also an option to auto-detect files without a BOM. The auto-detection is very reliable for UTF-8, but limited for UTF-16.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文