当前位置：文江博客话题详情

用于只读字典访问的最有效的内存数据结构

发布于 2024-12-22 04:30:50 字数 514 浏览 1 评论 0原文

在 C# 中，我有一些静态数据可以放入 Dictionary 中，其中 T 是某种引用类型。 Web 应用程序只需静态初始化一次（不会改变）。

由于我不必担心插入或删除性能，那么最好使用的数据结构是什么（或者我应该使用自己的数据结构）？我可能正在查看大约 100,000 个条目，间隔相当均匀。

我正在寻找一种获取这些数据的最佳算法。 Dictionary<> 还不错，但我想肯定有一些针对只读数据进行优化的东西。

我怀疑，但尚未证实这些键的范围可能是 0 - 400,000。如果是这样的话，建议会如何改变？（我想我会将其作为可能的答案发布）。

也许我可以：

扫描一次数据并获取最高的键
分配一个大小为最高键 + 1 的数组。
进行第二次扫描并将数据存储在数组中。

这比具有合理负载因子的哈希表/字典更好还是更差？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

奶茶白久 2024-12-29 04:30:50

字典是正确的选择。以下是 MSDN 的引用：

Dictionary(Of TKey, TValue) 泛型类提供了从
一组键对应一组值。每次添加到字典中的内容
由一个值及其关联的键组成。通过以下方式检索值
使用它的键非常快，接近 O(1)，因为 Dictionary(Of
TKey, TValue) 类被实现为哈希表。

因此，在构建字典（计算哈希值和构建树）时会花费大量时间，但通过键读取数据会非常快。

编辑

如果您在 0-400k 范围内存在超过 50% 的键，那么使用一个简单的数组是有意义的，其中键是项目的索引。在最佳情况下，这将为您带来 O(1) 复杂性。
根据您的问题，只有 25% 的密钥存在。所以我会选择 Dictionary<,>在这种情况下，与简单数组相比，我认为存储每个键值对的内存开销不会增加 75%。

回复收藏 0 原文

命比纸薄 2024-12-29 04:30:50

您可能需要使用 .Net8.0

参考中提供的 Frozen Dictionaries 进行结账： https://learn.microsoft.com/en-us/dotnet/api/system.collections.frozen.frozendictionary-2

状态：

FrozenDictionary;是不可变的，并且针对不经常创建字典但在运行时频繁使用的情况进行了优化。它的创建成本相对较高，但提供了出色的查找性能。因此，对于字典创建一次（可能在应用程序启动时）并在应用程序的剩余生命周期中使用的情况来说，它是理想的选择。

然后可以通过扩展方法实例化它，例如 new KeyValuePair[]{ new ("Hello", "World" )}.ToFrozenDictionary(); 还要提醒一下，使用命名空间是 System.Collections.Frozen。

这里有一个基准： https://code-corner.dev/2023/11/08/NET-8-%E2%80%94-FrozenDictionary-performance/

回复收藏 0 原文

裸钻 2024-12-29 04:30:50

如果它确实是字典，那么 trie 工作得相当好。 Dictionary（哈希表）是另一种可能性，只要你对其进行微调即可。哪个会更快......我不知道，我想你需要对其进行分析。从空间角度来看，trie 轻而易举地获胜。我认为 .NET 的标准库中没有 trie，但应该有一些实现。

回复收藏 0 原文

苏大泽ㄣ 2024-12-29 04:30:50

您问“最有效”的结构，但具体是指什么？通常，CPU 时间和内存使用之间需要进行权衡。

所以就读取性能而言最佳的数据结构是数组。正如你所描述的那样。这种数据结构始终比字典更快，至少在读取单个项目方面是这样。假设项目不太稀疏，它也比 Dictionary 占用更少的空间。

然而，数组是连续的内存块。这意味着项目之间的每个间隙都会影响内存使用。因此，给定 100k 个项目，您会浪费为 0-400k 范围内缺失的 300k 个项目保留的内存。话虽这么说，我已经进行了一些测试，只要 0-400k 范围覆盖至少 20%，阵列就会占用更少的空间。当然，YMMV。

因此，下一个最好的办法是使用固定数组，但具有完美哈希函数。因此，您分配一个大小为 100k 的数组，然后创建一个函数，将这些输入整数无间隙地映射到 0-100k 范围内的整数。如果输入集不改变，这样的函数总是存在。并且有一些算法可以在运行时构造它们（例如，请参阅 Botelho、Pagh 和 Ziviani 撰写的“Simple and Space-Efficient Minimal Perfect Hash Functions”论文）。借助完美的散列函数，您现在可以有效地存储这些项目，但您需要为散列支付运行时价格。这将比第一个变体稍慢，但仍然可能比字典更快（尽管我没有测试它）。但现在它需要最佳的内存量。

然后还有一些细微差别。如果您想检索单个项目，数组解决方案确实是最快的。但是，如果你想迭代数组，那么完美的哈希数组会更快。这是因为项目彼此靠近，因此对 cpu 缓存更友好。另外，您不必进行空检查。请注意，您可能需要使用此解决方案将 int 键与值存储在一起（这在 C# 中并不是什么大问题，您只需使用某些结构来包装值）。

You ask about "most efficient" structure, but in terms of what? Typically there's a tradeoff between cpu time and memory usage.

So the optimal data structure in terms of read performance is an array. Exactly as you've described it. This data structure is consistently faster than Dictionary, at least in terms of reading a single item. It also takes less space than Dictionary, assuming items are not too sparse.

However arrays are contiguous pieces of memory. Meaning that every gap between items will contribute to memory usage. So given 100k items, you waste memory reserved for the missing 300k items in 0-400k range. That being said I've been running some tests, and as long as the 0-400k range is covered in at least 20%, the array will take less space. But of course YMMV.

So the next best thing is to use a fixed array, but with perfect hash function. So you allocate an array of size 100k, and then you create a function that maps those input integers into integers in range 0-100k without gaps. Such functions always exist, if the input set doesn't change. And there are algorithms to construct them at runtime (for example see "Simple and Space-Efficient Minimal Perfect Hash Functions" paper by Botelho, Pagh and Ziviani). With the perfect hash function you now store those items efficiently, but you pay runtime price for hashing. This will be slightly slower then the first variant, but still likely to be faster than Dictionary (although I didn't test it). But now it takes optimal amount of memory.

Then there are nuances. The array solution is indeed the fastest if you want to retrieve a single item. However if you want to iterate over the array, then perfect hash array will be faster. That's because items are close to each other, and so it is more cpu cache friendly. Plus you don't have to do null checks. Note that its likely you would need to store the int key together with the value with this solution (which in C# is not a big deal, you would just wrap values with some struct).

回复收藏 0 原文

~没有更多了~