如何查找存储库历史记录中每个 git 提交的字数？

发布于 2025-01-04 17:09:07 字数 1120 浏览 4 评论 0原文

这是关于字数统计的，但我想这也与在存储库中的所有 git 提交上运行任何程序有关。我正在做一个写作项目，后来意识到我想在每次提交后以编程方式生成字数统计。仅适用于 tex 文件。但是，如何获得项目生命周期的计数呢？我找不到简单的方法来做到这一点，所以这就是我要问的。

我的解决方案是自动化检查项目生命周期中每个单独提交的分支的手动过程，并运行我的小 shell/sed/perl 脚本来获取日期和字数：

#!/usr/bin/env perl

use strict;
use warnings;
use 5.014;
use App::gh::Git;
use IPC::System::Simple qw(capture);

my $repo = Git->repository( Directory => '/home/amiri/MyProject/.git' );
my @commits
    = reverse $repo->command( 'rev-list', '--all', '--date', 'short' );

my $command
    = qq{find /home/amiri/MyProject -name "*.tex" | xargs wc -w | grep total | sed 's/[a-zA-Z[:space:]]//g'};

my $command2
    = q{git log | grep "Date:" | sed -n 1p | perl -pi -e "s/^Date:\s+//g" | perl -pi -e "s/2011 -\d+$/UTC 2011/g"};

for my $commit (@commits) {
    $repo->command( "checkout", "-b", "$commit", "$commit" );
    my $count = capture($command);
    my $date  = capture($command2);
    chomp $date;
    say "$date,$count";
    $repo->command( "checkout", "master" );
    $repo->command( 'branch', "-d", $commit );
}

所以，这可行，但我忍不住觉得还有更好的方法吗？看起来有点恶心。

原文

This is about word counts, but I guess it's also about running any program across all git commits in a repository. I am doing a writing project, and realized late that I wanted to generate the word count programmatically after each commit. Only for tex files. But then, how to get the counts for the life of the project? I could not find a simple way to do it, so that is what I am asking.

My solution was to automate the manual process of checking out a branch for each individual commit in the life of the project, and running my little shell/sed/perl scripts to get the date and the word count:

#!/usr/bin/env perl

use strict;
use warnings;
use 5.014;
use App::gh::Git;
use IPC::System::Simple qw(capture);

my $repo = Git->repository( Directory => '/home/amiri/MyProject/.git' );
my @commits
    = reverse $repo->command( 'rev-list', '--all', '--date', 'short' );

my $command
    = qq{find /home/amiri/MyProject -name "*.tex" | xargs wc -w | grep total | sed 's/[a-zA-Z[:space:]]//g'};

my $command2
    = q{git log | grep "Date:" | sed -n 1p | perl -pi -e "s/^Date:\s+//g" | perl -pi -e "s/2011 -\d+$/UTC 2011/g"};

for my $commit (@commits) {
    $repo->command( "checkout", "-b", "$commit", "$commit" );
    my $count = capture($command);
    my $date  = capture($command2);
    chomp $date;
    say "$date,$count";
    $repo->command( "checkout", "master" );
    $repo->command( 'branch', "-d", $commit );
}

So, this works, but I can't help but feel there's a better way to do it? It seems a little icky.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦情居士 2025-01-11 17:09:08

扩展 @cascabel 的答案，我最近需要对存档中的 MS Word 文档进行字数统计。我花了一段时间才弄清楚，所以它可以扩展到任何其他类型的二进制和计数......

TMPDIR=/tmp/a
mkdir $TMPDIR 2> /dev/null
for commit in `git rev-list --all`; do
    git log -n 1 --pretty=%ad $commit
    git archive $commit | (cd $TMPDIR ; rm *; tar -x '*.docx'; pandoc -t plain * | wc -w ) 2>/dev/null
done

Extending @cascabel's answer, I recently needed to do a word count on MS Word documents in an archive. Took me a while to figure out, so here it is, extendible to any other type of binary and count...

TMPDIR=/tmp/a
mkdir $TMPDIR 2> /dev/null
for commit in `git rev-list --all`; do
    git log -n 1 --pretty=%ad $commit
    git archive $commit | (cd $TMPDIR ; rm *; tar -x '*.docx'; pandoc -t plain * | wc -w ) 2>/dev/null
done

回复收藏 0 原文

醉酒的小男人 2025-01-11 17:09:07

如果你想要更容易实现的东西，并且不介意有点次优和笨拙，你可以这样做：

for commit in `git rev-list --all`; do
    git log -n 1 --pretty=%ad $commit
    git archive $commit | tar -x -O | wc -w
done

这比你拥有的要短得多，我怀疑它也可能更快，因为它避免了必须检查将文件写入磁盘只是为了再次读取它们以计算单词数。（要将其仅限于某些文件，您可以将它们作为附加参数传递给 git archive，并注意，您可以使用 git ls-tree 获取给定提交中的所有文件的列表-r --name-only。）

git log 行仅打印提交日期。如果您想要更多，请查看 man git-log 了解您可以执行的操作的描述 - 本质上有大量占位符，例如用于作者日期的 %ad， %s 表示提交主题，依此类推。下一行完成这项工作。 git archive 旨在将给定的树打包成 tar/zip 以便分发；我们立即解压它并计算单词数。（显然，您可以调整输出格式，并根据需要用您自己的计数机制替换 wc -w 。）

这已经相当快了 - 在一台有几年历史的笔记本电脑上，大约需要四分之一的时间在具有 20MB 工作树的存储库中，每次提交需要秒数。

当然，如果您真的很关心性能，那么绝对最快的方法可能是，对于每次提交，遍历树，对 blob 上的字数求和，并存储每个 blob 的字数，以便你不必重新叙述它们。不过，实施起来还有很多工作要做。伪代码可能如下所示：

word_counts(range)
    for (commit in `git rev-list <range>`)
        sum = 0
        for (blob in second_field_of(`git ls-tree -r commit`))
            if (!counts[blob])
                counts[blob] = word_count(`git cat-file blob`)
             total_count += counts[blob]
         print pretty_format(commit), total_count

 pretty_format(commit)
     return `git log -n 1 --pretty=... commit`

这避免了任何不必要的中间步骤，并通过避免重新读取任何未更改的文件来进一步优化。在小型存储库中这可能不是什么大问题，但在较大的存储库中这是一个大问题 - 想象一个 20MB 的存储库，其中平均提交总大小为 20KB 的触摸文件。

If you wanted something easier to implement, and don't mind being a little suboptimal and kludgy, you could do this:

for commit in `git rev-list --all`; do
    git log -n 1 --pretty=%ad $commit
    git archive $commit | tar -x -O | wc -w
done

This is way shorter than what you have, and I suspect it might also be faster, because it avoids having to check out files to disk just to read them again to count words. (To restrict it only to certain files, you can pass them as additional arguments to git archive, and note that you can get a list of all files in a given commit with git ls-tree -r --name-only <commit>.)

The git log line just prints the commit date. If you want more, have a look at man git-log for a description of the things you can do - essentially there are tons of placeholders like %ad for author date, %s for commit subject, and so on. The next line does the work. git archive is designed for packing up a given tree into a tar/zip for distribution; we just immediately untar it and count the words. (Obviously you can tweak the output format, and substitute in your own counting mechanism for wc -w if desired.)

This is already pretty fast - on a several-year-old laptop it took about a quarter second per commit in a repo with a 20MB work tree.

Of course, if you really really care about performance, probably the absolute fastest method would be to, for each commit, walk the tree, summing word counts over blobs, and storing word counts for each blob so that you don't have to recount them. This is a heck of a lot more work to implement, though. Pseudocode might look like this:

word_counts(range)
    for (commit in `git rev-list <range>`)
        sum = 0
        for (blob in second_field_of(`git ls-tree -r commit`))
            if (!counts[blob])
                counts[blob] = word_count(`git cat-file blob`)
             total_count += counts[blob]
         print pretty_format(commit), total_count

 pretty_format(commit)
     return `git log -n 1 --pretty=... commit`

This avoids any unnecessary intermediate steps, and further optimizes by avoiding having to re-read any unchanged files. In tiny repositories that might not be a big deal, but in larger repos it's a huge deal - imagine a 20MB repo where commits on average touch files of total size 20KB.

回复收藏 0 原文

~没有更多了~