处理大量整数的最佳方法
我有一个大约 10-100k 整数的数组,需要存储(尽可能压缩),并以最快的方式检索回完整数组。使用 c# 这样的语言处理此类事情的最佳方法是什么?
I have an array of about 10-100k ints that I need to store (as compressed as possible), and retrieve back to the complete array the fastest way possible. What is the best way to handle this type of thing in a language like c#.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这取决于“尽可能压缩”的意思。
您可以使用 BinaryWriter 将整数写入流,或使用 BitConverter.GetBytes 将每个 int 作为四个字节复制到一个大数组中。两者都会存储每个 int 而不需要任何额外的元数据。
如果您希望它比这更压缩,BinaryWriter 有一个 Write7BitEncodedInt 方法,可以用更少的字节写入小值的整数。将数据打包到字节数组后,您还可以使用 GZipStream 类尝试进一步压缩数据。
一般来说,你想要的越小,处理时间就越长。为了获得您想要的速度和大小之间的平衡,您只需进行一些测试。
That depends on what you mean by "as compressed as possible".
You can use a BinaryWriter to write the integers to a stream, or use BitConverter.GetBytes to get each int as four bytes as copy into a large array. Either would store each int without any extra meta data.
If you want it more compressed than that, the BinaryWriter has a Write7BitEncodedInt method that writes ints with small values in fewer bytes. You can also use the GZipStream class to try to further compress the data once you have it packed in a byte array.
Generally, the smaller you want it, the longer it will take to process. To get the balance between speed and size that you want, you just have to do some testing.
根据此 int 数组中值的性质,行程编码可能是另一种选择。也就是说,如果数组中的连续单元格都具有相同的值,则只需存储该序列中该值的第一次出现以及此后重复出现的次数。这对于“稀疏”数据可能特别有效。
Depending on the nature of the values in this
int
array, run-length encoding might be another option. That is, if contiguous cells in your array all have the same value, you only need to store the first occurrence of the value in that sequence, along with the number of times it will be repeated after that. This might work especially well with "sparse" data.100,000 个整数并没有那么大,为什么需要压缩这么多呢?
100,000 ints is not that big, why do you need to compress it so much?
回答您的具体问题
问题以最优化的方式解决。如果您想要磁盘压缩,请通过压缩库运行数据。当您尝试使用数据时,将数据压缩在内存中通常是不允许的(一般解决方案使用其他技术)。如果您需要了解为什么这是禁忌的信息,请注明。
大型数据集计算的一般答案
专门的数学库处理这些问题(例如,octave 或 matlab),特别是处理超出您计算机所能想象的数量的问题。
这些库具有执行引擎和特定语言,但您通常可以通过编程方式与它们交互。
Answer for your specific question
Problem solved in most optimal way. If you wanted on-disk compression, run the data through a zipping library. having the data compressed in-memory while you are trying to use it is generally a no-no (the general solution uses other techniques). Please indicate if you want information why it is a no-no.
General answer for computing with large data-sets
Specialized mathematics libraries deal with these issues (e.g., octave or matlab), specifically the issues of dealing with more numbers than you can think possible with your computer.
These libraries have an execution engine and a specific language, but you can often programmatically interface with them.