当前位置：文江博客话题详情

grep 二进制文件和 UTF16

发布于 2024-09-24 08:03:52 字数 252 浏览 4 评论 0原文

标准 grep/pcregrep 等可以方便地与 ASCII 或 UTF8 数据的二进制文件一起使用 - 有没有一种简单的方法让它们也尝试 UTF16（最好是同时，但相反会做）？

无论如何，我试图获取的数据都是 ASCII（库中的引用等），只是找不到它，因为有时任意两个字符之间有 00，有时则没有。

我看不到任何方法可以在语义上完成它，但是这些 00 应该可以解决问题，除非我无法在命令行上轻松使用它们。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吾家有女初长成 2024-10-01 08:03:52

最简单的方法是将文本文件转换为 utf-8 并将其通过管道传递给 grep：

iconv -f utf-16 -t utf-8 file.txt | grep query

我尝试执行相反的操作（将查询转换为 utf-16），但 grep 似乎不喜欢这样做。我认为这可能与字节序有关，但我不确定。

看起来 grep 会将 utf-16 查询转换为 utf-8/ascii。这是我尝试过的：

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

如果 test.txt 是 utf-16 文件，则这将不起作用，但如果 test.txt 是 ascii，则它确实有效。我只能得出结论，grep 正在将我的查询转换为 ascii。

编辑：这是一个非常非常疯狂的方法，虽然有效，但并没有给您提供非常有用的信息：

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

它是如何工作的？它将您的文件转换为十六进制（没有 hexdump 通常应用的任何额外格式）。它将其输入 grep 中。 Grep 使用的查询是通过将查询（不带换行符）回显到 iconv 中而构造的，后者将其转换为 utf-16。然后将其通过管道传输到 sed 以删除 BOM（用于确定字节顺序的 utf-16 文件的前两个字节）。然后将其通过管道传输到 hexdump 中，以便查询和输入相同。

不幸的是，我认为如果有一个匹配，这最终会打印出整个文件。此外，如果二进制文件中的 utf-16 存储在与您的机器不同的字节序中，这也不起作用。

编辑2：明白了！！！

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

这将在文件 test.txt 中搜索字符串 Test（utf-16 格式）的十六进制版本

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

回复收藏 0 原文

心碎的声音 2024-10-01 08:03:52

我发现以下解决方案最适合我，来自 https:// /www.splitbits.com/2015/11/11/tip-grep-and-unicode/

Grep 不能很好地处理 Unicode，但可以解决。例如，要

Some Search Term

在 UTF-16 文件中查找，使用正则表达式忽略每个字符中的第一个字节，

S.o.m.e. .S.e.a.r.c.h. .T.e.r.m

另外，告诉 grep 将文件视为文本，使用 '-a'，最终命令如下所示，

grep -a 'S.o.m.e. .S.e.a.r.c.h. .T.e.r.m' utf-16-file.txt

I found the below solution worked best for me, from https://www.splitbits.com/2015/11/11/tip-grep-and-unicode/

Grep does not play well with Unicode, but it can be worked around. For example, to find,

Some Search Term

in a UTF-16 file, use a regular expression to ignore the first byte in each character,

S.o.m.e. .S.e.a.r.c.h. .T.e.r.m

Also, tell grep to treat the file as text, using '-a', the final command looks like this,

grep -a 'S.o.m.e. .S.e.a.r.c.h. .T.e.r.m' utf-16-file.txt

回复收藏 0 原文

挽梦忆笙歌 2024-10-01 08:03:52

您可以在搜索字符串中显式包含空值 (00)，尽管您将得到带有空值的结果，因此您可能希望将输出重定向到文件，以便可以使用合理的编辑器查看它，或者通过 sed 将其通过管道传输到替换空值。要在 *.utf16.txt 中搜索“bar”：

grep -Pa "b\x00a\x00r" *.utf16.txt | sed 's/\x00//g'

“-P”告诉 grep 接受 Perl regexp 语法，该语法允许 \x00 扩展为 null，而 -a 告诉它忽略 Unicode 看起来像二进制的事实到它。

You can explicitly include the nulls (00s) in the search string, though you will get results with nulls, so you may want to redirect the output to a file so you can look at it with a reasonable editor, or pipe it through sed to replace the nulls. To search for "bar" in *.utf16.txt:

grep -Pa "b\x00a\x00r" *.utf16.txt | sed 's/\x00//g'

The "-P" tells grep to accept Perl regexp syntax, which allows \x00 to expand to null, and the -a tells it to ignore the fact that Unicode looks like binary to it.

回复收藏 0 原文

情话已封尘 2024-10-01 08:03:52

`ripgrep`

使用ripgrep 实用程序来grep UTF-16文件。

ripgrep 支持搜索 UTF-8 以外的文本编码的文件，例如 UTF-16、latin-1、GBK、EUC-JP、Shift_JIS 等。（提供了对自动检测 UTF-16 的一些支持。其他文本编码必须使用 -E/--encoding 标志专门指定。）

示例语法：

rg sometext file

转储所有行，运行： rg -N 。文件。

`ripgrep`

Use ripgrep utility to grep UTF-16 files.

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

Example syntax:

rg sometext file

To dump all lines, run: rg -N . file.

回复收藏 0 原文

兲鉂ぱ嘚淚 2024-10-01 08:03:52

在转储 Windows 注册表后，我一直使用这个，因为它的输出是 unicode。这是在 Cygwin 下运行的。

$ regedit /e registry.data.out
$ file registry.data.out
registry.data.out: Little-endian **UTF-16 Unicode text**, with CRLF line terminators

$ sed 's/\x00//g' registry.data.out | egrep "192\.168"
"Port"="192.168.1.5"
"IPSubnetAddress"="192.168.189.0"
"IPSubnetAddress"="192.168.102.0"
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
"MRU0"="192.168.16.93"
[HKEY_USERS\S-1-5-21-2054485685-3446499333-1556621121-1001\Software\Microsoft\Terminal Server Client\Servers\192.168.16.93]
"A"="192.168.1.23"
"B"="192.168.1.28"
"C"="192.168.1.200:5800"
"192.168.254.190::5901/extra"=hex:02,00
"00"="192.168.254.190:5901"
"ImagePrinterPort"="192.168.1.5"

I use this one all the time after dumping the Windows registry as its output is unicode. This is running under Cygwin.

$ regedit /e registry.data.out
$ file registry.data.out
registry.data.out: Little-endian **UTF-16 Unicode text**, with CRLF line terminators

$ sed 's/\x00//g' registry.data.out | egrep "192\.168"
"Port"="192.168.1.5"
"IPSubnetAddress"="192.168.189.0"
"IPSubnetAddress"="192.168.102.0"
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
"MRU0"="192.168.16.93"
[HKEY_USERS\S-1-5-21-2054485685-3446499333-1556621121-1001\Software\Microsoft\Terminal Server Client\Servers\192.168.16.93]
"A"="192.168.1.23"
"B"="192.168.1.28"
"C"="192.168.1.200:5800"
"192.168.254.190::5901/extra"=hex:02,00
"00"="192.168.254.190:5901"
"ImagePrinterPort"="192.168.1.5"

回复收藏 0 原文

趴在窗边数星星i 2024-10-01 08:03:52

ugrep 完全支持 Unicode、UTF-8/16/32 输入文件，检测无效的 Unicode 以确保正确的结果，显示文本和二进制文件，并且快速且免费：

ugrep 搜索 UTF-8/16/32 输入和其他格式。选项 --encoding 允许搜索许多其他文件格式，例如 ISO-8859-1 到 16、EBCDIC、代码页 437、850、858、1250 到 1258、MacRoman 和 KOI8。< /p>

披露：我是免费开源 ugrep 工具的原始作者，此后许多其他人都为该工具做出了贡献。

回复收藏 0 原文

养猫人 2024-10-01 08:03:52

我需要递归地执行此操作，这就是我想到的：

find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done

这绝对是可怕的并且非常慢；我确信有更好的方法，我希望有人可以改进它 - 但我很着急 :P

这些片段的作用：

find -type f

给出一个文件名的递归列表，其中包含相对于当前

while read l; do ... done

Bash 循环的路径；对于文件路径列表的每一行，将路径放入 $l 中并在循环中执行该操作。（为什么我使用 shell 循环而不是 xargs，这会快得多：我需要在输出的每一行前面加上当前文件的名称。如果我正在喂食，我想不出一种方法来做到这一点一次将多个文件转换为 iconv，而且由于无论如何我一次都会处理一个文件，因此 shell 循环更容易进行语法/转义。）

iconv -s -f utf-16le -t utf-8 "$l"

转换以 $l 命名的文件：假设输入文件是 utf-16 little-endian 并将其转换为 utf-8。 -s 使 iconv 关闭任何转换错误（会有很多，因为此目录结构中的某些文件不是 utf-16）。此转换的输出将发送到标准输出。

nl -s "$l: " | cut -c7-

这是一个 hack： nl 插入行号，但它恰好有一个“使用这个任意字符串将数字与行分开”参数，所以我输入文件名（后跟冒号和空格）在那。然后我使用 cut 去掉行号，只留下文件名前缀。（为什么我不使用 sed：这样转义要容易得多。如果我使用 sed 表达式，我必须担心文件名中存在正则表达式字符，在我的情况下有一个很多 nl 比 sed 更愚蠢，并且只会完全按照字面意思获取参数 -s，并且 shell 会为我处理转义。 .)

因此，在该管道的末尾，我已将一堆文件转换为 utf-8 行，并以文件名为前缀，然后对其进行 grep。如果存在匹配项，我可以从前缀判断它们位于哪个文件中。

注意事项

这比 grep -R 慢得多，因为我正在生成 iconv、nl、cut 的新副本和 grep 用于每个文件。太可怕了。
所有不是 utf-16le 输入的内容都将完全变成垃圾，因此如果有一个包含“somestring”的正常 ASCII 文件，此命令将不会报告它 - 您需要执行正常的 grep -R 以及此命令（如果您有多种 unicode 编码类型，例如一些大端和一些小端文件，则需要调整此命令并针对每种不同的编码再次运行它）。
名称恰好包含“somestring”的文件将显示在输出中，即使其内容没有匹配项。

I needed to do this recursively, and here's what I came up with:

find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done

This is absolutely horrible and very slow; I'm certain there's a better way and I hope someone can improve on it -- but I was in a hurry :P

What the pieces do:

find -type f

gives a recursive list of filenames with paths relative to current

while read l; do ... done

Bash loop; for each line of the list of file paths, put the path into $l and do the thing in the loop. (Why I used a shell loop instead of xargs, which would've been much faster: I need to prefix each line of the output with the name of the current file. Couldn't think of a way to do that if I was feeding multiple files at once to iconv, and since I'm going to be doing one file at a time anyway, shell loop is easier syntax/escaping.)

iconv -s -f utf-16le -t utf-8 "$l"

Convert the file named in $l: assume the input file is utf-16 little-endian and convert it to utf-8. The -s makes iconv shut up about any conversion errors (there will be a lot, because some files in this directory structure are not utf-16). The output from this conversion goes to stdout.

nl -s "$l: " | cut -c7-

This is a hack: nl inserts line numbers, but it happens to have a "use this arbitrary string to separate the number from the line" parameter, so I put the filename (followed by colon and space) in that. Then I use cut to strip off the line number, leaving just the filename prefix. (Why I didn't use sed: escaping is much easier this way. If I used a sed expression, I have to worry about there regular expression characters in the filenames, which in my case there were a lot of. nl is much dumber than sed, and will just take the parameter -s entirely literally, and the shell handles the escaping for me.)

So, by the end of this pipeline, I've converted a bunch of files into lines of utf-8, prefixed with the filename, which I then grep. If there are matches, I can tell which file they're in from the prefix.

Caveats

This is much, much slower than grep -R, because I'm spawning a new copy of iconv, nl, cut, and grep for every single file. It's horrible.
Everything that isn't utf-16le input will come out as complete garbage, so if there's a normal ASCII file that contains 'somestring', this command won't report it -- you need to do a normal grep -R as well as this command (and if you have multiple unicode encoding types, like some big-endian and some little-endian files, you need to adjust this command and run it again for each different encoding).
Files whose name happens to contain 'somestring' will show up in the output, even if their contents have no matches.

回复收藏 0 原文

沒落の蓅哖 2024-10-01 08:03:52

我将其添加为对上面已接受答案的评论，但为了使其更易于阅读。这允许您在一堆文件中搜索文本，同时还显示正在查找文本的文件名。所有这些文件都有 .reg 扩展名，因为我正在搜索导出的 Windows 注册表文件。只需将 .reg 替换为任何文件扩展名即可。

// Define grepreg in bash by pasting at bash command prompt
grepreg ()
{
    find -name '*.reg' -exec echo {} \; -exec iconv -f utf-16 -t utf-8 {} \; | grep "$1\|\.reg"
}

// Sample usage
grepreg SampleTextToSearch

I added this as a comment to the accepted answer above but to make it easier to read. This allow you to search for text in a bunch of files while also displaying the filenames that it is finding the text. All of these files have a .reg extension since I'm searching through exported Windows Registry files. Just replace .reg with any file extension.

// Define grepreg in bash by pasting at bash command prompt
grepreg ()
{
    find -name '*.reg' -exec echo {} \; -exec iconv -f utf-16 -t utf-8 {} \; | grep "$1\|\.reg"
}

// Sample usage
grepreg SampleTextToSearch

回复收藏 0 原文

时光无声 2024-10-01 08:03:52

您可以使用以下 Ruby 的单行代码：

ruby -e "puts File.open('file.txt', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new 'PATTERN'.encode(Encoding::UTF_16LE))"

为了简单起见，可以将其定义为 shell 函数，例如：

grep-utf16() { ruby -e "puts File.open('$2', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new '$1'.encode(Encoding::UTF_16LE))"; }

然后以类似于 grep 的方式使用它：

grep-utf16 PATTERN file.txt

来源：如何对 UTF-16 文件使用 Ruby 的 readlines.grep？

You can use the following Ruby's one-liner:

ruby -e "puts File.open('file.txt', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new 'PATTERN'.encode(Encoding::UTF_16LE))"

For simplicity, this can be defined as the shell function like:

grep-utf16() { ruby -e "puts File.open('$2', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new '$1'.encode(Encoding::UTF_16LE))"; }

Then it be used in similar way like grep:

grep-utf16 PATTERN file.txt

Source: How to use Ruby's readlines.grep for UTF-16 files?

回复收藏 0 原文

兔姬 2024-10-01 08:03:52

sed 语句超出了我的理解范围。我有一个简单的、远非完美的 TCL 脚本，我认为它可以满足我的测试点之一：

#!/usr/bin/tclsh

set insearch [lindex $argv 0]

set search ""

for {set i 0} {$i<[string length $insearch]-1} {incr i} {
    set search "${search}[string range $insearch $i $i]."
}
set search "${search}[string range $insearch $i $i]"

for {set i 1} {$i<$argc} {incr i} {
    set file [lindex $argv $i]
    set status 0
    if {! [catch {exec grep -a $search $file} results options]} {
        puts "$file: $results"
    }
}

The sed statement is more than I can wrap my head around. I have a simplistic, far-from-perfect TCL script that I think does an OK job with my test point of one:

#!/usr/bin/tclsh

set insearch [lindex $argv 0]

set search ""

for {set i 0} {$i<[string length $insearch]-1} {incr i} {
    set search "${search}[string range $insearch $i $i]."
}
set search "${search}[string range $insearch $i $i]"

for {set i 1} {$i<$argc} {incr i} {
    set file [lindex $argv $i]
    set status 0
    if {! [catch {exec grep -a $search $file} results options]} {
        puts "$file: $results"
    }
}

回复收藏 0 原文

~没有更多了~