基于 SHA-1 的目录结构和 NTFS 限制?

发布于 2024-08-14 10:58:41 字数 893 浏览 6 评论 0原文

我有一个应用程序,它将基于文件的数据存储在 NTFS 目录路径下,该目录路径关闭数据的 SHA-1 哈希值。它有几个非常好的属性(重复数据删除、不受其他元数据更改的影响等),但我很好奇人们在创建基于哈希的目录存储结构时所经历的最佳实践。我主要关心的是可以在给定文件夹深度实际存储的文件/文件夹的数量。

有谁知道我会遇到什么样的限制?如果我将它们全部转储到存储路径根目录的文件夹中,我觉得我会严重限制存储增长的能力。虽然这不会很快成为问题,但我宁愿拥有一个可以避免这种情况的结构,也不愿稍后尝试重组大量存储。

如果我采取一种方法对签名进行分块以创建更深的树,是否有关于我需要将其分块多少的指导?像这样的东西就足够了吗?

StringBuilder foo = new StringBuilder(60);
// ...root, etc.
// SHA-1 always has a length of 40, chunk it up to distribute into smaller groups
// "\0000\0000000000000000\00000000000000000000"
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 0, 4);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 4, 16);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 20, 20);

知道 SHA-1 具有相当不错的分布,我不得不假设最终会出现大型集群,但平均分布会均匀。我关心的是那些集群。

访问太宽的目录结构时是否会产生性能损失?我知道 Windows 资源管理器会令人窒息,但是通过 C# / System.IO 以编程方式访问又如何呢?

I have an app that is storing file-based data under a NTFS directory path which keys off the SHA-1 hash of the data. It has several really nice attributes (de-duplication, impervious to other metadata changes, etc.) but I'm curious about the best practices people have experienced for creating hash-based directory storage structures. My primary concern is the number of files/folders which can be realistically stored at a given folder depth.

Does anyone know what sorts of limitations I'll run into? If I were to dump them all into folders at the root of the storage path, I feel like I would severely limit the ability for the storage to grow. While it won't be a problem soon I'd rather have a structure that avoids this than try to restructure a massive storage later.

If I took an approach to chunk up the signature to create a deeper tree, is there any guidance on how much would I need to chunk it? Would something like this suffice?

StringBuilder foo = new StringBuilder(60);
// ...root, etc.
// SHA-1 always has a length of 40, chunk it up to distribute into smaller groups
// "\0000\0000000000000000\00000000000000000000"
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 0, 4);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 4, 16);
foo.Append(Path.DirectorySeparatorChar);
foo.Append(sha1, 20, 20);

Knowing that SHA-1 has a pretty decent distribution, I would have to assume that eventually there would be large clusters but that on average it would be evenly distributed. It is those clusters that I'm concerned about.

Are there performance penalties when accessing directory structures which are too wide? I know that Windows Explorer will choke, but what about programmatically accessing via C# / System.IO?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

溺深海 2024-08-21 10:58:42

一些观察结果:

  • 您在 4 个和 10 个以上字符后分裂。 4 个字符本身可以导致目录中的 65536 个条目,10 个字符将导致 16^10 个条目,这肯定太多了(而且还剩下更多字符......)
  • 所以下一个问题是:如何做到的你选择这个数字吗?在我看来,它们就像神奇数字。您似乎希望您的拆分在所有情况下都能完成工作...

您关于可以处理的目录深度的问题很好 - 我无法回答。但是您应该看看,如果 20 个嵌套目录太多而无法处理,因为 20 个级别允许您每个级别最多保留 256 个条目:

xx/xx/xx/xx/xx/...

另一方面,您可以坚持使用 4 个字符,这会导致深度最多 10 和 65536 个条目:

xxxx/xxxx/xxxx/xxxx/xxxx/...

但是 - 在这两种情况下,我可能会编写一个动态算法,该算法检查每个级别的项目数并根据需要引入新的子文件夹。因此,前 256 个(或 65536 个)项目只会进入一个目录。

Some observations:

  • You split after 4 and 10 more chars. 4 chars on their own can lead to 65536 entries in a directory, 10 chars will lead to 16^10 entries, which is certainly way too much (and there are still more charactes remaining ...)
  • So the next question is: How did you choose this numbers? They look to me like magic numbers. You seem to hope that your splits will do the work in all cases ...

Your question about the directory deepth that can be handled is good - and I can't answer it. But you should have a look, if 20 nested directories are too much to handle, because 20 levels allow you to keep a maximum of 256 entries per level:

xx/xx/xx/xx/xx/...

On the other hand you could stick with your 4 characters, which would lead to a depth of 10 and 65536 entries maximum:

xxxx/xxxx/xxxx/xxxx/xxxx/...

However - in both cases I'd probably write a dynamic algorithm, which checks the number of items per level and introduces new subfolders as you need them. So the first 256 (or 65536) items would just go to one directory.

三人与歌 2024-08-21 10:58:42

添加碰撞检测器和解析器。您最好做好准备,以防有人尝试检查 SHA-1 冲突向量。

我还没有看到任何 SHA-1 冲突,但我确实看到了意外 MD5 冲突的糟糕案例,有人认为它们是独一无二的。

无论如何,NTFS 使用 BTree 目录结构,因此您确实可以将所有内容放在一个文件夹中。但 Windows 资源管理器不会喜欢它。

Add a collision detector and resolver. You had better be ready in case someone tries to check in SHA-1 collision vectors.

I've not seen any SHA-1 collisions yet but I did see a bad case of an accidental MD5 collision where someone thought they were unique.

Anyway, NTFS uses BTree directory structures so you really could place all in one folder. Windows Explorer won't like it though.

梦魇绽荼蘼 2024-08-21 10:58:42

感谢其他回答者的见解。

从网络上的其他问题来看,NTFS 可以处理这些大小,但 Windows 资源管理器和网络操作可能会在低得多的情况下出现阻塞阈值。我运行了一个非常均匀的随机分布的模拟,类似于 SHA-1 为一组随机的 1,000,000 个“文件”生成的分布。

Windows 资源管理器绝对不喜欢 4 的目录宽度,因为它很快就接近该级别的最大值 (65536)。我将前两个目录长度调整为每个 3(最大 4096),并将剩余的 34 位数字放在第三级,以尝试平衡深度与每级目录过多的概率。这似乎允许 Windows 资源管理器处理浏览结构。

这是我的模拟:

const string Root = @"C:\_Sha1Buckets";
using (TextWriter writer = File.CreateText(@"C:\_Sha1Buckets.txt"))
{
    // simulate a very even distribution like SHA-1 would produce
    RandomNumberGenerator rand = RandomNumberGenerator.Create();
    byte[] sha1 = new byte[20];
    Stopwatch watch = Stopwatch.StartNew();

    for (int i=0; i<1000000; i++)
    {
        // populate bytes with a fake SHA-1
        rand.GetBytes(sha1);

        // format bytes into hex string
        string hash = FormatBytes(sha1);

        // C:\_Sha1Buckets
        StringBuilder builder = new StringBuilder(Root, 60);

        // \012\345\6789abcdef0123456789abcdef01234567\
        builder.Append(Path.DirectorySeparatorChar);
        builder.Append(hash, 0, 3);
        builder.Append(Path.DirectorySeparatorChar);
        builder.Append(hash, 3, 3);
        builder.Append(Path.DirectorySeparatorChar);
        builder.Append(hash, 6, 34);
        builder.Append(Path.DirectorySeparatorChar);

        Directory.CreateDirectory(builder.ToString());
        if (i % 5000 == 0)
        {
            // write out timings every five thousand files to see if changes
            writer.WriteLine("{0}: {1}", i, watch.Elapsed);
            Console.WriteLine("{0}: {1}", i, watch.Elapsed);
            watch.Reset();
            watch.Start();
        }
    }

    watch.Reset();
    Console.WriteLine("Press any key to delete the directory structure...");
    Console.ReadLine();
    watch.Start();
    Directory.Delete(Root, true);
    writer.WriteLine("Delete took {0}", watch.Elapsed);
    Console.WriteLine("Delete took {0}", watch.Elapsed);
}

大约 5000 次之后,模拟速度似乎有所减慢(每 5000 次 15-20 秒),但仍保持该速度。最后的删除在我的机器上花费了 30 多分钟!

对于 100 万个哈希值,分布的计算结果如下:

  • 第一级有 4096 个文件夹
  • 第二级平均有 250 个文件夹
  • 第三级平均有 1 个文件夹

这在 Windows 资源管理器中非常易于管理,并且似乎不会太深或太宽。显然,如果分布不均匀,那么我们可能会遇到问题,但在第三层。前两个级别的界限为 4096。我想如果目标集更大,我们可以添加一个额外的级别并获得很大的增长潜力。对于我的应用程序来说,100 万是一个非常合理的上限。

有人对这种确定目录结构启发式测试的有效性有什么想法吗?

Thanks to the other answerers for their insight.

It sounds like from other questions around the web that NTFS can handle the sizes, but Windows Explorer and network operations will potentially choke at much lower thresholds. I ran a simulation of a very even random distribution similar to what SHA-1 would produce for a random set of 1,000,000 "files".

Windows Explorer definitely did not like a directory width of 4 as it very quickly approached the maximum (65536) for that level. I tweaked the top two directory lengths to be 3 each (4096 max), and put the remaining 34 digits in the third level to attempt to balance depth versus probability of too many directories per level. This seems to allow Windows Explorer to handle browsing the structure.

Here's my simulation:

const string Root = @"C:\_Sha1Buckets";
using (TextWriter writer = File.CreateText(@"C:\_Sha1Buckets.txt"))
{
    // simulate a very even distribution like SHA-1 would produce
    RandomNumberGenerator rand = RandomNumberGenerator.Create();
    byte[] sha1 = new byte[20];
    Stopwatch watch = Stopwatch.StartNew();

    for (int i=0; i<1000000; i++)
    {
        // populate bytes with a fake SHA-1
        rand.GetBytes(sha1);

        // format bytes into hex string
        string hash = FormatBytes(sha1);

        // C:\_Sha1Buckets
        StringBuilder builder = new StringBuilder(Root, 60);

        // \012\345\6789abcdef0123456789abcdef01234567\
        builder.Append(Path.DirectorySeparatorChar);
        builder.Append(hash, 0, 3);
        builder.Append(Path.DirectorySeparatorChar);
        builder.Append(hash, 3, 3);
        builder.Append(Path.DirectorySeparatorChar);
        builder.Append(hash, 6, 34);
        builder.Append(Path.DirectorySeparatorChar);

        Directory.CreateDirectory(builder.ToString());
        if (i % 5000 == 0)
        {
            // write out timings every five thousand files to see if changes
            writer.WriteLine("{0}: {1}", i, watch.Elapsed);
            Console.WriteLine("{0}: {1}", i, watch.Elapsed);
            watch.Reset();
            watch.Start();
        }
    }

    watch.Reset();
    Console.WriteLine("Press any key to delete the directory structure...");
    Console.ReadLine();
    watch.Start();
    Directory.Delete(Root, true);
    writer.WriteLine("Delete took {0}", watch.Elapsed);
    Console.WriteLine("Delete took {0}", watch.Elapsed);
}

After about fifty thousand, the simulation appears to slow down a bit (15-20 sec per 5000) but stays at that rate. The delete at the end took over 30 min on my machine!

The distributions work out like this for 1 million hashes:

  • 1st level has 4096 folders
  • 2nd level has average of 250 folders
  • 3rd level has an average of 1 folder

That is very manageable within Windows Explorer and doesn't seem to get too deep or wide. Obviously if the distribution weren't this even, then we could run into problems, but only at the third level. The first two levels are bounded at 4096. I suppose if the target set were larger, we could add an additional level and gain a lot of growth potential. For my application 1 million is a very reasonable upper bound.

Anyone have any thoughts on the validity of such a test for determining directory structure heuristics?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文