grep 但可索引?
我有超过 200mb 的源代码文件,我必须不断查找它们(我是一个非常大的团队的一部分)。我注意到 grep 不会创建索引,因此每次查找都需要遍历整个源代码数据库。
是否有类似于 grep 的具有索引功能的命令行实用程序?
I have over 200mb of source code files that I have to constantly look up (I am part of a very big team). I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Is there a command line utility similar to grep which has indexing ability?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
下面的解决方案相当简单。它们没有涵盖很多极端情况:
解决方案的好处是它们非常容易实现。
解决方案 1:一个大文件
事实:查找非常慢,读取一个大文件通常会更快。
鉴于这些事实,我们的想法是简单地创建一个包含所有文件及其所有内容的索引 - 每行都以文件名和行号开头:
索引目录:
使用索引:
解决方案 2:一个大的压缩文件< /strong>
事实:硬盘速度很慢。寻找是极其缓慢的。多核CPU是正常的。
因此,读取压缩文件并即时解压缩可能比读取未压缩文件更快 - 特别是如果您有足够的 RAM 来缓存压缩文件但不足以缓存未压缩文件。
为目录建立索引:
使用索引:
解决方案 3:使用索引查找潜在候选项
生成索引可能非常耗时,并且您可能不希望对目录中的每一个更改都执行此操作。
为了加快速度,仅使用索引来识别可能匹配的文件名,并通过这些(希望数量有限的)文件执行实际的 grep 操作。这将发现不再匹配的文件,但不会发现匹配的新文件。
需要
sort -u
以避免多次 grep 同一文件。索引目录:
使用索引:
解决方案 4:附加到索引
重新创建完整索引可能会非常慢。如果大部分目录保持不变,您只需将新更改的文件附加到索引即可。该索引将再次仅用于查找潜在的候选文件,因此,如果文件不再匹配,则在 grep 实际文件时会发现该文件。
索引目录:
附加到索引:
使用索引:
如果使用
pzstd
而不是pbzip2
/pbzcat
,速度会更快。解决方案5:使用git
git grep
可以通过git存储库进行grep。但它似乎做了很多搜索,并且在我的系统上比解决方案 4 慢 4 倍。好的部分是 .git 索引比 .index.bz2 小。
索引目录:
附加到索引:
使用索引:
解决方案 6:优化 git
Git 将其数据放入许多小文件中。这导致了寻找。但是您可以要求 git 将小文件压缩为几个更大的文件:
这需要一段时间,但它可以非常有效地将索引打包在几个文件中。
现在您可以执行以下操作:
git
将对索引进行大量查找,但首先运行cat
,将整个索引放入 RAM 中。添加到索引与解决方案 5 相同,但时不时运行 git gc 以避免出现许多小文件,并运行 git gc --aggressive 来节省更多磁盘空间,当系统空闲时。
如果删除文件,
git
将不会释放磁盘空间。因此,如果您删除大量数据,请删除.git
并执行git init;再次 git add .
。The solutions below are rather simple. There are a lot of corner cases that they do not cover:
The good part about the solutions is that they are very easy to implement.
Solution 1: one big file
Fact: Seeking is dead slow, reading one big file is often faster.
Given those facts the idea is to simply make an index containing all the files with all their content - each line prepended with the filename and the line number:
Index a dir:
Use the index:
Solution 2: one big compressed file
Fact: Harddrives are slow. Seeking is dead slow. Multi core CPUs are normal.
So it may be faster to read a compressed file and decompress it on the fly than reading the uncompressed file - especially if you have RAM enough to cache the compressed file but not enough for the uncompressed file.
Index a dir:
Use the index:
Solution 3: use index for finding potential candidates
Generating the index can be time consuming and you might not want to do that for every single change in the dir.
To speed that up only use the index for identifying filenames that might match and do an actual grep through those (hopefully limited number of) files. This will discover files that no longer match, but it will not discover new files that do match.
The
sort -u
is needed to avoid grepping the same file multiple times.Index a dir:
Use the index:
Solution 4: append to the index
Re-creating the full index can be very slow. If most of the dir stays the same, you can simply append to the index with newly changed files. The index will again only be used for locating potential candidates, so if a file no longer matches it will be discovered when grepping through the actual file.
Index a dir:
Append to the index:
Use the index:
It can be even faster if you use
pzstd
instead ofpbzip2
/pbzcat
.Solution 5: use git
git grep
can grep through a git repository. But it seems to do a lot of seeks and is 4 times slower on my system than solution 4.The good part is that the .git index is smaller than the .index.bz2.
Index a dir:
Append to the index:
Use the index:
Solution 6: optimize git
Git puts its data into many small files. This results in seeking. But you can ask git to compress the small files into few, bigger files:
This takes a while, but it packs the index very efficiently in few files.
Now you can do:
git
will do a lot of seeking into the index, but by runningcat
first, you put the whole index into RAM.Adding to the index is the same as in solution 5, but run
git gc
now and then to avoid many small files, andgit gc --aggressive
to save more disk space, when the system is idle.git
will not free disk space if you remove files. So if you remove large amounts of data, remove.git
and dogit init; git add .
again.有一个 https://code.google.com/p/codesearch/ 项目能够创建索引并在索引中快速搜索。支持正则表达式并使用索引进行计算(实际上,只有正则表达式的子集可以使用索引来过滤文件集,然后在匹配的文件上重新评估真正的正则表达式)。
codesearch 的索引通常是源代码大小的 10-20%,构建索引的速度就像运行经典 grep 2 或 3 次一样快,并且搜索几乎是瞬时的。
codesearch 项目中使用的想法来自 google 的代码搜索站点 (RIP)。例如,索引包含从 n-gram(3-gram 或源中找到的每个 3 字节集)到文件的映射;搜索时,正则表达式会被转换为 4-gram。
PS 还有 ctags 和 cscope 可以在 C/C++ 源代码中导航。 Ctags 可以查找声明/定义,cscope 更强大,但在 C++ 中存在问题。
PPS 以及用于 C/C++/ObjC 语言的基于 clang 的工具:http://blog.wuwon.id.au/2011/10/vim-plugin-for-navigating-c-with.html 和铿锵完成
There is https://code.google.com/p/codesearch/ project which is capable of creating index and fast searching in the index. Regexps are supported and computed using index (actually, only subset of regexp can use index to filter file set, and then real regexp is reevaluted on the matched files).
Index from codesearch is usually 10-20% of source code size, building an index is fast like running classic grep 2 or 3 times, and the searching is almost instantaneous.
The ideas used in the codesearch project are from google's Code Search site (RIP). E.g. the index contains map from n-grams (3-grams or every 3-byte set found in your sources) to the files; and regexp is translated to 4-grams when searching.
PS And there are ctags and cscope to navigate in C/C++ sources. Ctags can find declarations/definitions, cscope is more capable, but has problems with C++.
PPS and there are also clang-based tools for C/C++/ObjC languages: http://blog.wuwon.id.au/2011/10/vim-plugin-for-navigating-c-with.html and clang-complete
如果不解决索引能力部分,git grep 将具有与 Git 2.8(2016 年第一季度)并行运行的能力!
请参阅提交89f09dd,提交 044b1f3, 提交 b6b468b(2015 年 12 月 15 日),作者:维克多·莱斯楚克(
vleschuk
)。(由 Junio C Hamano --
gitster
-- 合并于 提交bdd1cc2,2016 年 1 月 12 日)Without addressing the indexing ability part, git grep will have, with Git 2.8 (Q1 2016) the abililty to run in parallel!
See commit 89f09dd, commit 044b1f3, commit b6b468b (15 Dec 2015) by Victor Leschuk (
vleschuk
).(Merged by Junio C Hamano --
gitster
-- in commit bdd1cc2, 12 Jan 2016)ack 是一个代码搜索工具,针对程序员,尤其是处理大型异构源代码树的程序员进行了优化:http://beyondgrep.com/
在您的某些搜索示例中,您是否只想搜索某种类型的文件,例如仅搜索 Java 文件?然后你可以做
ack 不索引源代码,但这可能并不重要,具体取决于你的搜索模式。在许多情况下,仅搜索某些类型的文件即可提供所需的加速,因为您无需搜索所有其他 XML 等文件。
如果 ack 不能为您做到这一点,这里列出了许多专为搜索源代码而设计的工具: http: //beyondgrep.com/more-tools/
ack is a code searching tool that is optimized for programmers, especially programmers dealing with large heterogeneous source code trees: http://beyondgrep.com/
Is some of your search examples where you only want to search a certain type of file, like only Java files? Then you can do
ack does not index the source code, but it may not matter depending on what your searching patterns are like. In many cases, only searching for certain types of files gives the speedup that you need because you're not also searching all those other XML, etc files.
And if ack doesn't do it for you, here is a list of many tools designed for searching source code: http://beyondgrep.com/more-tools/
我们在内部使用一个工具来索引非常大的日志文件并对其进行有效的搜索。它已经开源。不过,我不知道它扩展到大量文件的效果如何。默认情况下它是多线程的,它在 gzip 压缩文件中搜索,并缓存以前搜索的文件的索引。
https://github.com/purestorage/4grep
We use a tool internally to index very large log files and make efficient searches of them. It has been open-sourced. I don't know how well it scales to large numbers of files, though. It multithreads by default, it searches inside gzipped files, and it caches indexes of previously searched files.
https://github.com/purestorage/4grep
这篇 grep-cache 文章有一个用于缓存 grep 结果的脚本。他的例子是在安装了linux工具的windows上运行的,因此只需很少的修改就可以轻松地在nix/mac上使用。无论如何,它主要只是一个 perl 脚本。
此外,文件系统本身(假设您使用 *nix)经常缓存最近读取的数据,导致未来的 grep 时间更快,因为 grep 有效地搜索虚拟内存而不是磁盘。
如果您想手动擦除缓存以查看从未缓存到缓存的 grep 的速度提升,缓存通常位于
/proc/sys/vm/drop_caches
中。This grep-cache article has a script for caching grep results. His examples were run on windows with linux tools installed, so it can easily be used on nix/mac with little modification. It's mostly just a perl script anyway.
Also, the filesystem itself (assuming your using *nix) often caches recently read data, causing future grep times to be faster since grep is effectively searching virt memory instead of disk.
The cache is usually located in
/proc/sys/vm/drop_caches
if you want manually erase it to see the speed increase from an uncached to a cached grep.由于您提到了各种并非真正代码的文本文件,我建议您查看 GNU ID 实用程序。例如:
这些工具专注于令牌,因此不可能对令牌字符串进行查询。 emacs 中对 gid 命令的集成很少。
对于索引源代码的更具体情况,我更喜欢使用 GNU global,我发现它更灵活。例如:
Global 原生支持 C/C++ 和 Java,并且通过一些配置,可以扩展以支持更多语言。它还与 emacs 具有非常好的集成:连续的查询被堆叠,并且更新源文件可以有效地更新索引。但是我不知道它能够索引纯文本(还)。
Since you mention various kinds of text files that are not really code, I suggest you have a look at GNU ID utils. For example:
These tools focus on tokens, so queries on strings of tokens are not possible. There is minimal integration in emacs for the gid command.
For the more specific case of indexing source code, I prefer to use GNU global, which I find more flexible. For example:
Global natively supports C/C++ and Java, and with a bit of configuration, can be extended to support many more languages. It also has very good integration with emacs: successive queries are stacked, and updating a source file updates the index efficiently. However I'm not aware that it is able to index plain text (yet).