grep 二进制文件和 UTF16
标准 grep
/pcregrep
等可以方便地与 ASCII 或 UTF8 数据的二进制文件一起使用 - 有没有一种简单的方法让它们也尝试 UTF16(最好是同时,但相反会做)?
无论如何,我试图获取的数据都是 ASCII(库中的引用等),只是找不到它,因为有时任意两个字符之间有 00,有时则没有。
我看不到任何方法可以在语义上完成它,但是这些 00 应该可以解决问题,除非我无法在命令行上轻松使用它们。
Standard grep
/pcregrep
etc. can conveniently be used with binary files for ASCII or UTF8 data - is there a simple way to make them try UTF16 too (preferably simultaneously, but instead will do)?
Data I'm trying to get is all ASCII anyway (references in libraries etc.), it just doesn't get found as sometimes there's 00 between any two characters, and sometimes there isn't.
I don't see any way to get it done semantically, but these 00s should do the trick, except I cannot easily use them on command line.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
最简单的方法是将文本文件转换为 utf-8 并将其通过管道传递给 grep:
我尝试执行相反的操作(将查询转换为 utf-16),但 grep 似乎不喜欢这样做。我认为这可能与字节序有关,但我不确定。
看起来 grep 会将 utf-16 查询转换为 utf-8/ascii。这是我尝试过的:
如果 test.txt 是 utf-16 文件,则这将不起作用,但如果 test.txt 是 ascii,则它确实有效。我只能得出结论,grep 正在将我的查询转换为 ascii。
编辑:这是一个非常非常疯狂的方法,虽然有效,但并没有给您提供非常有用的信息:
它是如何工作的?它将您的文件转换为十六进制(没有 hexdump 通常应用的任何额外格式)。它将其输入 grep 中。 Grep 使用的查询是通过将查询(不带换行符)回显到 iconv 中而构造的,后者将其转换为 utf-16。然后将其通过管道传输到 sed 以删除 BOM(用于确定字节顺序的 utf-16 文件的前两个字节)。然后将其通过管道传输到 hexdump 中,以便查询和输入相同。
不幸的是,我认为如果有一个匹配,这最终会打印出整个文件。此外,如果二进制文件中的 utf-16 存储在与您的机器不同的字节序中,这也不起作用。
编辑2:明白了!!!
这将在文件
test.txt
中搜索字符串Test
(utf-16 格式)的十六进制版本The easiest way is to just convert the text file to utf-8 and pipe that to grep:
I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.
It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:
If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.
EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:
How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.
Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.
EDIT2: Got it!!!!
This searches for the hex version of the string
Test
(in utf-16) in the filetest.txt
我发现以下解决方案最适合我,来自 https:// /www.splitbits.com/2015/11/11/tip-grep-and-unicode/
Grep 不能很好地处理 Unicode,但可以解决。例如,要
在 UTF-16 文件中查找,使用正则表达式忽略每个字符中的第一个字节,
另外,告诉 grep 将文件视为文本,使用 '-a',最终命令如下所示,
I found the below solution worked best for me, from https://www.splitbits.com/2015/11/11/tip-grep-and-unicode/
Grep does not play well with Unicode, but it can be worked around. For example, to find,
in a UTF-16 file, use a regular expression to ignore the first byte in each character,
Also, tell grep to treat the file as text, using '-a', the final command looks like this,
您可以在搜索字符串中显式包含空值 (00),尽管您将得到带有空值的结果,因此您可能希望将输出重定向到文件,以便可以使用合理的编辑器查看它,或者通过 sed 将其通过管道传输到替换空值。要在 *.utf16.txt 中搜索“bar”:
“-P”告诉 grep 接受 Perl regexp 语法,该语法允许 \x00 扩展为 null,而 -a 告诉它忽略 Unicode 看起来像二进制的事实到它。
You can explicitly include the nulls (00s) in the search string, though you will get results with nulls, so you may want to redirect the output to a file so you can look at it with a reasonable editor, or pipe it through sed to replace the nulls. To search for "bar" in *.utf16.txt:
The "-P" tells grep to accept Perl regexp syntax, which allows \x00 to expand to null, and the -a tells it to ignore the fact that Unicode looks like binary to it.
ripgrep
使用
ripgrep
实用程序来grep UTF-16文件。示例语法:
转储所有行,运行: rg -N 。文件。
ripgrep
Use
ripgrep
utility to grep UTF-16 files.Example syntax:
To dump all lines, run:
rg -N . file
.在转储 Windows 注册表后,我一直使用这个,因为它的输出是 unicode。这是在 Cygwin 下运行的。
I use this one all the time after dumping the Windows registry as its output is unicode. This is running under Cygwin.
ugrep 完全支持 Unicode、UTF-8/16/32 输入文件,检测无效的 Unicode 以确保正确的结果,显示文本和二进制文件,并且快速且免费:
披露:我是免费开源 ugrep 工具的原始作者,此后许多其他人都为该工具做出了贡献。
ugrep fully supports Unicode, UTF-8/16/32 input files, detects invalid Unicode to ensure proper results, displays text and binary files, and is fast and free:
Disclosure: I'm the original author of the free open source ugrep tool that many others have contributed to since.
我需要递归地执行此操作,这就是我想到的:
这绝对是可怕的并且非常慢;我确信有更好的方法,我希望有人可以改进它 - 但我很着急 :P
这些片段的作用:
给出一个文件名的递归列表,其中包含相对于当前
Bash 循环的路径;对于文件路径列表的每一行,将路径放入
$l
中并在循环中执行该操作。 (为什么我使用 shell 循环而不是 xargs,这会快得多:我需要在输出的每一行前面加上当前文件的名称。如果我正在喂食,我想不出一种方法来做到这一点一次将多个文件转换为 iconv,而且由于无论如何我一次都会处理一个文件,因此 shell 循环更容易进行语法/转义。)转换以
$l
命名的文件:假设输入文件是 utf-16 little-endian 并将其转换为 utf-8。-s
使 iconv 关闭任何转换错误(会有很多,因为此目录结构中的某些文件不是 utf-16)。此转换的输出将发送到标准输出。这是一个 hack:
nl
插入行号,但它恰好有一个“使用这个任意字符串将数字与行分开”参数,所以我输入文件名(后跟冒号和空格)在那。然后我使用cut
去掉行号,只留下文件名前缀。 (为什么我不使用 sed:这样转义要容易得多。如果我使用 sed 表达式,我必须担心文件名中存在正则表达式字符,在我的情况下有一个很多nl
比sed
更愚蠢,并且只会完全按照字面意思获取参数-s
,并且 shell 会为我处理转义。 .)因此,在该管道的末尾,我已将一堆文件转换为 utf-8 行,并以文件名为前缀,然后对其进行 grep。如果存在匹配项,我可以从前缀判断它们位于哪个文件中。
注意事项
grep
用于每个文件。太可怕了。I needed to do this recursively, and here's what I came up with:
This is absolutely horrible and very slow; I'm certain there's a better way and I hope someone can improve on it -- but I was in a hurry :P
What the pieces do:
gives a recursive list of filenames with paths relative to current
Bash loop; for each line of the list of file paths, put the path into
$l
and do the thing in the loop. (Why I used a shell loop instead of xargs, which would've been much faster: I need to prefix each line of the output with the name of the current file. Couldn't think of a way to do that if I was feeding multiple files at once to iconv, and since I'm going to be doing one file at a time anyway, shell loop is easier syntax/escaping.)Convert the file named in
$l
: assume the input file is utf-16 little-endian and convert it to utf-8. The-s
makes iconv shut up about any conversion errors (there will be a lot, because some files in this directory structure are not utf-16). The output from this conversion goes to stdout.This is a hack:
nl
inserts line numbers, but it happens to have a "use this arbitrary string to separate the number from the line" parameter, so I put the filename (followed by colon and space) in that. Then I usecut
to strip off the line number, leaving just the filename prefix. (Why I didn't usesed
: escaping is much easier this way. If I used a sed expression, I have to worry about there regular expression characters in the filenames, which in my case there were a lot of.nl
is much dumber thansed
, and will just take the parameter-s
entirely literally, and the shell handles the escaping for me.)So, by the end of this pipeline, I've converted a bunch of files into lines of utf-8, prefixed with the filename, which I then grep. If there are matches, I can tell which file they're in from the prefix.
Caveats
grep -R
, because I'm spawning a new copy oficonv
,nl
,cut
, andgrep
for every single file. It's horrible.grep -R
as well as this command (and if you have multiple unicode encoding types, like some big-endian and some little-endian files, you need to adjust this command and run it again for each different encoding).我将其添加为对上面已接受答案的评论,但为了使其更易于阅读。这允许您在一堆文件中搜索文本,同时还显示正在查找文本的文件名。所有这些文件都有 .reg 扩展名,因为我正在搜索导出的 Windows 注册表文件。只需将 .reg 替换为任何文件扩展名即可。
I added this as a comment to the accepted answer above but to make it easier to read. This allow you to search for text in a bunch of files while also displaying the filenames that it is finding the text. All of these files have a .reg extension since I'm searching through exported Windows Registry files. Just replace .reg with any file extension.
您可以使用以下 Ruby 的单行代码:
为了简单起见,可以将其定义为 shell 函数,例如:
然后以类似于 grep 的方式使用它:
来源: 如何对 UTF-16 文件使用 Ruby 的 readlines.grep?
You can use the following Ruby's one-liner:
For simplicity, this can be defined as the shell function like:
Then it be used in similar way like grep:
Source: How to use Ruby's readlines.grep for UTF-16 files?
sed 语句超出了我的理解范围。我有一个简单的、远非完美的 TCL 脚本,我认为它可以满足我的测试点之一:
The sed statement is more than I can wrap my head around. I have a simplistic, far-from-perfect TCL script that I think does an OK job with my test point of one: