存储大型查找表
我正在开发一个应用程序,它利用非常大的查找表来加速数学计算。 这些表中最大的是一个 int[],大约有 1000 万个条目。 并非所有查找表都是 int[]。 例如,其中一个是包含约 200,000 个条目的字典。 目前,我生成每个查找表一次(这需要几分钟),并使用以下代码片段将其序列化到磁盘(通过压缩):其中
int[] lut = GenerateLUT();
lut.Serialize("lut");
序列化定义如下:
public static void Serialize(this object obj, string file)
{
using (FileStream stream = File.Open(file, FileMode.Create))
{
using (var gz = new GZipStream(stream, CompressionMode.Compress))
{
var formatter = new BinaryFormatter();
formatter.Serialize(gz, obj);
}
}
}
我遇到的烦恼是启动应用程序时,反序列化这些查找表需要很长时间(最多 15 秒)。 这种类型的延迟会惹恼用户,因为在加载所有查找表之前应用程序将无法使用。 目前反序列化如下:
int[] lut1 = (Dictionary<string, int>) Deserialize("lut1");
int[] lut2 = (int[]) Deserialize("lut2");
...
其中反序列化定义为:
public static object Deserialize(string file)
{
using (FileStream stream = File.Open(file, FileMode.Open))
{
using (var gz = new GZipStream(stream, CompressionMode.Decompress))
{
var formatter = new BinaryFormatter();
return formatter.Deserialize(gz);
}
}
}
起初,我认为可能是 gzip 压缩导致速度变慢,但删除它只从序列化/反序列化例程中节省了几百毫秒。
任何人都可以建议一种在应用程序首次启动时加快这些查找表的加载时间的方法吗?
I am developing an app that utilizes very large lookup tables to speed up mathematical computations. The largest of these tables is an int[] that has ~10 million entries. Not all of the lookup tables are int[]. For example, one is a Dictionary with ~200,000 entries. Currently, I generate each lookup table once (which takes several minutes) and serialize it to disk (with compression) using the following snippet:
int[] lut = GenerateLUT();
lut.Serialize("lut");
where Serialize is defined as follows:
public static void Serialize(this object obj, string file)
{
using (FileStream stream = File.Open(file, FileMode.Create))
{
using (var gz = new GZipStream(stream, CompressionMode.Compress))
{
var formatter = new BinaryFormatter();
formatter.Serialize(gz, obj);
}
}
}
The annoyance I am having is when launching the application, is that the Deserialization of these lookup tables is taking very long (upwards of 15 seconds). This type of delay will annoy users as the app will be unusable until all the lookup tables are loaded. Currently the Deserialization is as follows:
int[] lut1 = (Dictionary<string, int>) Deserialize("lut1");
int[] lut2 = (int[]) Deserialize("lut2");
...
where Deserialize is defined as:
public static object Deserialize(string file)
{
using (FileStream stream = File.Open(file, FileMode.Open))
{
using (var gz = new GZipStream(stream, CompressionMode.Decompress))
{
var formatter = new BinaryFormatter();
return formatter.Deserialize(gz);
}
}
}
At first, I thought it might have been the gzip compression that was causing the slowdown, but removing it only skimmed a few hundred milliseconds from the Serialization/Deserialization routines.
Can anyone suggest a way of speeding up the load times of these lookup tables upon the app's initial startup?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
首先,在后台线程中反序列化将防止应用程序在发生这种情况时“挂起”。 仅此一点就足以解决您的问题。
然而,一般来说,序列化和反序列化(尤其是大型字典)往往非常慢。 根据数据结构,编写自己的序列化代码可以显着加快速度,特别是在数据结构中没有共享引用的情况下。
话虽如此,根据其使用模式,数据库可能是更好的方法。 您始终可以制作更面向数据库的东西,并从数据库以惰性方式构建查找表(即:查找是在 LUT 中查找,但如果查找不存在,则从数据库加载它并保存它在表中)。 这将使启动瞬间完成(至少就 LUT 而言),并且可能仍然保持查找相当快。
First, deserializing in a background thread will prevent the app from "hanging" while this happens. That alone may be enough to take care of your problem.
However, Serialization and deserialization (especially of large dictionaries) tends to be very slow, in general. Depending on the data structure, writing your own serialization code can dramatically speed this up, particularly if there are no shared references in the data structures.
That being said, depending on the usage pattern of this, a database might be a better approach. You could always make something that was more database oriented, and build the lookup table in a lazy fashion from the DB (ie: a lookup is lookup in the LUT, but if the lookup doesn't exist, load it from the DB and save it in the table). This would make startup instantaneous (at least in terms of the LUT), and probably still keep lookups fairly snappy.
我想最明显的建议是在后台加载它们。 一旦应用程序启动,用户打开了他们的项目,并选择了他们想要的任何操作,剩下的 15 秒就不会再等待了。
I guess the obvious suggestion is to load them in the background. Once the app has started, the user has opened their project, and selected whatever operation they want, there won't be much of that 15 seconds left to wait.
我们在这里谈论的到底有多少数据? 根据我的经验,将 1GB 从磁盘读入内存大约需要 20 秒。 因此,如果您读取的数据超过半千兆字节,则几乎肯定会遇到硬件限制。
如果数据传输速率不是问题,那么实际的反序列化需要时间。 如果您有足够的内存,则可以将所有表加载到内存缓冲区中(使用 File.ReadAllBytes() ),然后从内存流中反序列化。 这将使您能够确定读取花费了多少时间,以及反序列化花费了多少时间。
如果反序列化花费大量时间,如果您有多个处理器,则可以生成多个线程来并行执行序列化。 使用这样的系统,您可能会在加载另一个表的数据时反序列化一个或多个表。 这种流水线方法可以使您的整个加载/反序列化时间几乎与仅加载一样快。
Just how much data are we talking about here? In my experience, it takes about 20 seconds to read a gigabyte from disk into memory. So if you're reading upwards of half a gigabyte, you're almost certainly running into hardware limitations.
If data transfer rate isn't the problem, then the actual deserialization is taking time. If you have enough memory, you can load all of the tables into memory buffers (using
File.ReadAllBytes()
) and then deserialize from a memory stream. That will allow you to determine how much time reading is taking, and how much time deserialization is taking.If deserialization is taking a lot of time, you could, if you have multiple processors, spawn multiple threds to do the serialization in parallel. With such a system, you could potentially be deserializing one or more tables while loading the data for another. That pipelined approach could make your entire load/deserialization time be almost as fast as load only.
另一种选择是将表放入真实的数据库表中。 即使像 Access 这样的引擎也应该产生相当好的性能,因为每个查询都有一个明显的索引。 现在,应用程序只需在实际要使用数据时读取数据,即使这样,它也能准确地知道在文件内部的何处查看。
这可能会使应用程序的实际性能降低一些,因为您必须为每次计算进行磁盘读取。 但这将使应用程序的感知性能变得更好,因为永远不会出现漫长的等待。 而且,无论你喜欢与否,这种看法可能比现实更重要。
Another option is to put your tables into, well, tables: real database tables. Even an engine like Access should yield pretty good performance, because you have an obvious index for every query. Now the app only has to read in data when it's actually about to use it, and even then it's going to know exactly where to look inside the file.
This might make the app's actual performance a bit lower, because you have to do a disk read for every calculation. But it would make the app's perceived performance much better, because there's never a long wait. And, like it or not, the perception is probably more important than the reality.
为什么要拉上拉链?
磁盘比 RAM 大。
直接二进制读取应该很快。
Why zip them?
Disk is bigger than RAM.
A straight binary read should be pretty quick.