用于只读字典访问的最有效的内存数据结构
在 C# 中,我有一些静态数据可以放入 Dictionary
中,其中 T
是某种引用类型。 Web 应用程序只需静态初始化一次(不会改变)。
由于我不必担心插入或删除性能,那么最好使用的数据结构是什么(或者我应该使用自己的数据结构)?我可能正在查看大约 100,000 个条目,间隔相当均匀。
我正在寻找一种获取这些数据的最佳算法。 Dictionary<>
还不错,但我想肯定有一些针对只读数据进行优化的东西。
我怀疑,但尚未证实这些键的范围可能是 0 - 400,000。如果是这样的话,建议会如何改变? (我想我会将其作为可能的答案发布)。
也许我可以:
- 扫描一次数据并获取最高的键
- 分配一个大小为最高键 + 1 的数组。
- 进行第二次扫描并将数据存储在数组中。
这比具有合理负载因子的哈希表/字典更好还是更差?
In C#, I have some static data that could be put in a Dictionary<int, T>
where T
is some reference type. The web app only needs to initialize it once, statically (it doesn't change).
Since I don't have to worried about insert or delete performance, what is the best data structure to use (or should I roll my own)? I'm probably looking at something like ~100,000 entries, fairly evenly spaced.
I am looking for an optimal algorithm for fetching this data. Dictionary<>
isn't bad, but I would imagine there must be something out there optimized for read-only data.
I suspect, but haven't confirmed that the range of these keys might be 0 - 400,000. If that was the case, how would the recommendations change? (I have a thought that I will post as a possible answer).
Maybe I could:
- Scan through the data once and grab the highest key
- Allocate an array with the size of the highest key + 1.
- Take a second pass and store the data in the array.
Would this be better or worse than a HashTable / Dictionary with a reasonable load factor?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
字典是正确的选择。以下是 MSDN 的引用:
因此,在构建字典(计算哈希值和构建树)时会花费大量时间,但通过键读取数据会非常快。
编辑
如果您在 0-400k 范围内存在超过 50% 的键,那么使用一个简单的数组是有意义的,其中键是项目的索引。在最佳情况下,这将为您带来 O(1) 复杂性。
根据您的问题,只有 25% 的密钥存在。所以我会选择 Dictionary<,>在这种情况下,与简单数组相比,我认为存储每个键值对的内存开销不会增加 75%。
A Dictionary is the right way to go. Here is a quote from MSDN:
So it will take a lot of time while building dictionary (calculating hashes and building tree), but will be blizzard fast to read your data by key.
Edit
In case you will have more than 50% keys present from range 0-400k it makes sense to go with a simple array, where a key is the item's index. This will give you O(1) complexity in a best-case scenario.
According to your question, only 25% of keys would be present. So I would go with Dictionary<,> in this case, I don't think that it has 75% overhead of memory to store each key-value pair comparing to simple array.
您可能需要使用 .Net8.0
参考中提供的 Frozen Dictionaries 进行结账: https://learn.microsoft.com/en-us/dotnet/api/system.collections.frozen.frozendictionary-2
状态:
然后可以通过扩展方法实例化它,例如
new KeyValuePair[]{ new ("Hello", "World" )}.ToFrozenDictionary();
还要提醒一下,使用命名空间是 System.Collections.Frozen。这里有一个基准: https://code-corner.dev/2023/11/08/NET-8-%E2%80%94-FrozenDictionary-performance/
You might want to checkout using Frozen Dictionaries available in .Net8.0
Reference: https://learn.microsoft.com/en-us/dotnet/api/system.collections.frozen.frozendictionary-2
States:
It can then be instantiated from an extension method, such as
new KeyValuePair<string, string>[]{ new ("Hello", "World" )}.ToFrozenDictionary();
also remind, the using namespace isSystem.Collections.Frozen
.There is a benchmark available here: https://code-corner.dev/2023/11/08/NET-8-%E2%80%94-FrozenDictionary-performance/
如果它确实是字典,那么 trie 工作得相当好。
Dictionary
(哈希表)是另一种可能性,只要你对其进行微调即可。哪个会更快......我不知道,我想你需要对其进行分析。从空间角度来看,trie 轻而易举地获胜。我认为 .NET 的标准库中没有 trie,但应该有一些实现。If it's really dictionary, trie works reasonably well.
Dictionary
(a hashtable) is another possibility, as long as you fine-tune it. Which would be faster... I don't know, you'd need to profile it, I guess. Space-wise, trie wins hands down. I don't think .NET has a trie in its standard library, but there should be some implementations floating around.您问“最有效”的结构,但具体是指什么?通常,CPU 时间和内存使用之间需要进行权衡。
所以就读取性能而言最佳的数据结构是数组。正如你所描述的那样。这种数据结构始终比字典更快,至少在读取单个项目方面是这样。假设项目不太稀疏,它也比 Dictionary 占用更少的空间。
然而,数组是连续的内存块。这意味着项目之间的每个间隙都会影响内存使用。因此,给定 100k 个项目,您会浪费为 0-400k 范围内缺失的 300k 个项目保留的内存。话虽这么说,我已经进行了一些测试,只要 0-400k 范围覆盖至少 20%,阵列就会占用更少的空间。当然,YMMV。
因此,下一个最好的办法是使用固定数组,但具有完美哈希函数。因此,您分配一个大小为 100k 的数组,然后创建一个函数,将这些输入整数无间隙地映射到 0-100k 范围内的整数。如果输入集不改变,这样的函数总是存在。并且有一些算法可以在运行时构造它们(例如,请参阅 Botelho、Pagh 和 Ziviani 撰写的“Simple and Space-Efficient Minimal Perfect Hash Functions”论文)。借助完美的散列函数,您现在可以有效地存储这些项目,但您需要为散列支付运行时价格。这将比第一个变体稍慢,但仍然可能比字典更快(尽管我没有测试它)。但现在它需要最佳的内存量。
然后还有一些细微差别。如果您想检索单个项目,数组解决方案确实是最快的。但是,如果你想迭代数组,那么完美的哈希数组会更快。这是因为项目彼此靠近,因此对 cpu 缓存更友好。另外,您不必进行空检查。请注意,您可能需要使用此解决方案将 int 键与值存储在一起(这在 C# 中并不是什么大问题,您只需使用某些结构来包装值)。
You ask about "most efficient" structure, but in terms of what? Typically there's a tradeoff between cpu time and memory usage.
So the optimal data structure in terms of read performance is an array. Exactly as you've described it. This data structure is consistently faster than Dictionary, at least in terms of reading a single item. It also takes less space than Dictionary, assuming items are not too sparse.
However arrays are contiguous pieces of memory. Meaning that every gap between items will contribute to memory usage. So given 100k items, you waste memory reserved for the missing 300k items in 0-400k range. That being said I've been running some tests, and as long as the 0-400k range is covered in at least 20%, the array will take less space. But of course YMMV.
So the next best thing is to use a fixed array, but with perfect hash function. So you allocate an array of size 100k, and then you create a function that maps those input integers into integers in range 0-100k without gaps. Such functions always exist, if the input set doesn't change. And there are algorithms to construct them at runtime (for example see "Simple and Space-Efficient Minimal Perfect Hash Functions" paper by Botelho, Pagh and Ziviani). With the perfect hash function you now store those items efficiently, but you pay runtime price for hashing. This will be slightly slower then the first variant, but still likely to be faster than Dictionary (although I didn't test it). But now it takes optimal amount of memory.
Then there are nuances. The array solution is indeed the fastest if you want to retrieve a single item. However if you want to iterate over the array, then perfect hash array will be faster. That's because items are close to each other, and so it is more cpu cache friendly. Plus you don't have to do null checks. Note that its likely you would need to store the int key together with the value with this solution (which in C# is not a big deal, you would just wrap values with some struct).