当前位置：文江博客话题详情

grep 但可索引？

发布于 2024-12-09 12:02:10 字数 125 浏览 0 评论 0原文

我有超过 200mb 的源代码文件，我必须不断查找它们（我是一个非常大的团队的一部分）。我注意到 grep 不会创建索引，因此每次查找都需要遍历整个源代码数据库。

是否有类似于 grep 的具有索引功能的命令行实用程序？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏夜暖风 2024-12-16 12:02:10

下面的解决方案相当简单。它们没有涵盖很多极端情况：

搜索
包含 \n 或 : 的行首 ^ 文件名将失败
包含空格的文件名将失败（尽管可以通过使用 GNU Parallel 而不是 xargs 来修复
）与另一个文件的路径匹配的字符串将不是最佳的

解决方案的好处是它们非常容易实现。

解决方案 1：一个大文件

事实：查找非常慢，读取一个大文件通常会更快。

鉴于这些事实，我们的想法是简单地创建一个包含所有文件及其所有内容的索引 - 每行都以文件名和行号开头：

索引目录：

find . -type f -print0 | xargs -0 grep -Han . > .index

使用索引：

grep foo .index

解决方案 2：一个大的压缩文件< /strong>

事实：硬盘速度很慢。寻找是极其缓慢的。多核CPU是正常的。

因此，读取压缩文件并即时解压缩可能比读取未压缩文件更快 - 特别是如果您有足够的 RAM 来缓存压缩文件但不足以缓存未压缩文件。

为目录建立索引：

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

使用索引：

pbzcat .index | grep foo

解决方案 3：使用索引查找潜在候选项

生成索引可能非常耗时，并且您可能不希望对目录中的每一个更改都执行此操作。

为了加快速度，仅使用索引来识别可能匹配的文件名，并通过这些（希望数量有限的）文件执行实际的 grep 操作。这将发现不再匹配的文件，但不会发现匹配的新文件。

需要 sort -u 以避免多次 grep 同一文件。

索引目录：

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

使用索引：

pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo

解决方案 4：附加到索引

重新创建完整索引可能会非常慢。如果大部分目录保持不变，您只需将新更改的文件附加到索引即可。该索引将再次仅用于查找潜在的候选文件，因此，如果文件不再匹配，则在 grep 实际文件时会发现该文件。

索引目录：

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

附加到索引：

find . -type f -newer .index -print0 | xargs -0 grep -Han . | pbzip2 >> .index

使用索引：

pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo

如果使用 pzstd 而不是 pbzip2/pbzcat，速度会更快。

解决方案5：使用git

git grep 可以通过git存储库进行grep。但它似乎做了很多搜索，并且在我的系统上比解决方案 4 慢 4 倍。

好的部分是 .git 索引比 .index.bz2 小。

索引目录：

git init
git add .

附加到索引：

git add .

使用索引：

git grep foo

解决方案 6：优化 git

Git 将其数据放入许多小文件中。这导致了寻找。但是您可以要求 git 将小文件压缩为几个更大的文件：

git gc --aggressive

这需要一段时间，但它可以非常有效地将索引打包在几个文件中。

现在您可以执行以下操作：

find .git  -type f | xargs cat >/dev/null
git grep foo

git 将对索引进行大量查找，但首先运行 cat，将整个索引放入 RAM 中。

添加到索引与解决方案 5 相同，但时不时运行 git gc 以避免出现许多小文件，并运行 git gc --aggressive 来节省更多磁盘空间，当系统空闲时。

如果删除文件，git 将不会释放磁盘空间。因此，如果您删除大量数据，请删除 .git 并执行 git init;再次 git add . 。

The solutions below are rather simple. There are a lot of corner cases that they do not cover:

searching for start of line ^
filenames containing \n or : will fail
filenames containing white space will fail (though that can be fixed by using GNU Parallel instead of xargs)
searching for a string that matches the path of another files will be suboptimal

The good part about the solutions is that they are very easy to implement.

Solution 1: one big file

Fact: Seeking is dead slow, reading one big file is often faster.

Given those facts the idea is to simply make an index containing all the files with all their content - each line prepended with the filename and the line number:

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . > .index

Use the index:

grep foo .index

Solution 2: one big compressed file

Fact: Harddrives are slow. Seeking is dead slow. Multi core CPUs are normal.

So it may be faster to read a compressed file and decompress it on the fly than reading the uncompressed file - especially if you have RAM enough to cache the compressed file but not enough for the uncompressed file.

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

Use the index:

pbzcat .index | grep foo

Solution 3: use index for finding potential candidates

Generating the index can be time consuming and you might not want to do that for every single change in the dir.

To speed that up only use the index for identifying filenames that might match and do an actual grep through those (hopefully limited number of) files. This will discover files that no longer match, but it will not discover new files that do match.

The sort -u is needed to avoid grepping the same file multiple times.

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

Use the index:

pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo

Solution 4: append to the index

Re-creating the full index can be very slow. If most of the dir stays the same, you can simply append to the index with newly changed files. The index will again only be used for locating potential candidates, so if a file no longer matches it will be discovered when grepping through the actual file.

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

Append to the index:

find . -type f -newer .index -print0 | xargs -0 grep -Han . | pbzip2 >> .index

Use the index:

pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo

It can be even faster if you use pzstd instead of pbzip2/pbzcat.

Solution 5: use git

git grep can grep through a git repository. But it seems to do a lot of seeks and is 4 times slower on my system than solution 4.

The good part is that the .git index is smaller than the .index.bz2.

Index a dir:

git init
git add .

Append to the index:

git add .

Use the index:

git grep foo

Solution 6: optimize git

Git puts its data into many small files. This results in seeking. But you can ask git to compress the small files into few, bigger files:

git gc --aggressive

This takes a while, but it packs the index very efficiently in few files.

Now you can do:

find .git  -type f | xargs cat >/dev/null
git grep foo

git will do a lot of seeking into the index, but by running cat first, you put the whole index into RAM.

Adding to the index is the same as in solution 5, but run git gc now and then to avoid many small files, and git gc --aggressive to save more disk space, when the system is idle.

git will not free disk space if you remove files. So if you remove large amounts of data, remove .git and do git init; git add . again.

回复收藏 0 原文

一花一树开 2024-12-16 12:02:10

有一个 https://code.google.com/p/codesearch/ 项目能够创建索引并在索引中快速搜索。支持正则表达式并使用索引进行计算（实际上，只有正则表达式的子集可以使用索引来过滤文件集，然后在匹配的文件上重新评估真正的正则表达式）。

codesearch 的索引通常是源代码大小的 10-20%，构建索引的速度就像运行经典 grep 2 或 3 次一样快，并且搜索几乎是瞬时的。

codesearch 项目中使用的想法来自 google 的代码搜索站点 (RIP)。例如，索引包含从 n-gram（3-gram 或源中找到的每个 3 字节集）到文件的映射；搜索时，正则表达式会被转换为 4-gram。

PS 还有 ctags 和 cscope 可以在 C/C++ 源代码中导航。 Ctags 可以查找声明/定义，cscope 更强大，但在 C++ 中存在问题。

PPS 以及用于 C/C++/ObjC 语言的基于 clang 的工具：http://blog.wuwon.id.au/2011/10/vim-plugin-for-navigating-c-with.html 和铿锵完成

回复收藏 0 原文

沉默的熊 2024-12-16 12:02:10

我注意到 grep 不会创建索引，因此每次查找都需要遍历整个源代码数据库。

如果不解决索引能力部分，git grep 将具有与 Git 2.8（2016 年第一季度）并行运行的能力！

请参阅提交89f09dd，提交 044b1f3, 提交 b6b468b（2015 年 12 月 15 日），作者：维克多·莱斯楚克（vleschuk）。
^{（由 Junio C Hamano -- gitster -- 合并于提交bdd1cc2，2016 年 1 月 12 日）}

grep：添加--threads=选项和grep.threads配置
“git grep”现在可以配置（或从命令行告诉）如何
在工作树文件中搜索时要使用许多线程。

grep.threads:

要使用的 grep 工作线程数。

I notice that grep does not create an index so lookup requires going through the entire source code database each time.

Without addressing the indexing ability part, git grep will have, with Git 2.8 (Q1 2016) the abililty to run in parallel!

See commit 89f09dd, commit 044b1f3, commit b6b468b (15 Dec 2015) by Victor Leschuk (vleschuk).
^{(Merged by Junio C Hamano -- gitster -- in commit bdd1cc2, 12 Jan 2016)}

grep: add --threads=<num> option and grep.threads configuration
"git grep" can now be configured (or told from the command line) how
many threads to use when searching in the working tree files.

grep.threads:

Number of grep worker threads to use.

回复收藏 0 原文

亢潮 2024-12-16 12:02:10

ack 是一个代码搜索工具，针对程序员，尤其是处理大型异构源代码树的程序员进行了优化：http://beyondgrep.com/

在您的某些搜索示例中，您是否只想搜索某种类型的文件，例如仅搜索 Java 文件？然后你可以做

ack --java function

ack 不索引源代码，但这可能并不重要，具体取决于你的搜索模式。在许多情况下，仅搜索某些类型的文件即可提供所需的加速，因为您无需搜索所有其他 XML 等文件。

如果 ack 不能为您做到这一点，这里列出了许多专为搜索源代码而设计的工具： http: //beyondgrep.com/more-tools/

ack is a code searching tool that is optimized for programmers, especially programmers dealing with large heterogeneous source code trees: http://beyondgrep.com/

Is some of your search examples where you only want to search a certain type of file, like only Java files? Then you can do

ack --java function

ack does not index the source code, but it may not matter depending on what your searching patterns are like. In many cases, only searching for certain types of files gives the speedup that you need because you're not also searching all those other XML, etc files.

And if ack doesn't do it for you, here is a list of many tools designed for searching source code: http://beyondgrep.com/more-tools/

回复收藏 0 原文

就像说晚安 2024-12-16 12:02:10

我们在内部使用一个工具来索引非常大的日志文件并对其进行有效的搜索。它已经开源。不过，我不知道它扩展到大量文件的效果如何。默认情况下它是多线程的，它在 gzip 压缩文件中搜索，并缓存以前搜索的文件的索引。

https://github.com/purestorage/4grep

回复收藏 0 原文

对你再特殊 2024-12-16 12:02:10

这篇 grep-cache 文章有一个用于缓存 grep 结果的脚本。他的例子是在安装了linux工具的windows上运行的，因此只需很少的修改就可以轻松地在nix/mac上使用。无论如何，它主要只是一个 perl 脚本。

此外，文件系统本身（假设您使用 *nix）经常缓存最近读取的数据，导致未来的 grep 时间更快，因为 grep 有效地搜索虚拟内存而不是磁盘。

如果您想手动擦除缓存以查看从未缓存到缓存的 grep 的速度提升，缓存通常位于 /proc/sys/vm/drop_caches 中。

回复收藏 0 原文

无敌元气妹 2024-12-16 12:02:10

由于您提到了各种并非真正代码的文本文件，我建议您查看 GNU ID 实用程序。例如：

cd /tmp
# create index file named 'ID'
mkid -m /dev/null  -d text /var/log/messages.*
# query index
gid -r 'spamd|kernel'

这些工具专注于令牌，因此不可能对令牌字符串进行查询。 emacs 中对 gid 命令的集成很少。

对于索引源代码的更具体情况，我更喜欢使用 GNU global，我发现它更灵活。例如：

cd sourcedir
# index source tree
gtags .
# look for a definition
global -x main
# look for a reference
global -xr printf
# look for another kind of symbol
global -xs argc

Global 原生支持 C/C++ 和 Java，并且通过一些配置，可以扩展以支持更多语言。它还与 emacs 具有非常好的集成：连续的查询被堆叠，并且更新源文件可以有效地更新索引。但是我不知道它能够索引纯文本（还）。

Since you mention various kinds of text files that are not really code, I suggest you have a look at GNU ID utils. For example:

cd /tmp
# create index file named 'ID'
mkid -m /dev/null  -d text /var/log/messages.*
# query index
gid -r 'spamd|kernel'

These tools focus on tokens, so queries on strings of tokens are not possible. There is minimal integration in emacs for the gid command.

For the more specific case of indexing source code, I prefer to use GNU global, which I find more flexible. For example:

cd sourcedir
# index source tree
gtags .
# look for a definition
global -x main
# look for a reference
global -xr printf
# look for another kind of symbol
global -xs argc

Global natively supports C/C++ and Java, and with a bit of configuration, can be extended to support many more languages. It also has very good integration with emacs: successive queries are stacked, and updating a source file updates the index efficiently. However I'm not aware that it is able to index plain text (yet).

回复收藏 0 原文

~没有更多了~