从巨大的二进制文件中反序列化对象的最快方法
所以场景如下:我有一个 2-3 GB 的二进制序列化对象的大文件,我还有一个索引文件,其中包含每个对象的 id 及其在文件中的偏移量。
我需要编写一个方法,给定一组 id 将它们反序列化到内存中。性能是最重要的基准,保持合理的内存需求是第二个。
使用 MemoryMappedFile 似乎是可行的方法,但是我有点不确定如何处理大文件。我无法为整个文件创建 MemoryMappedViewAccessor,因为它太大了。我可以同时打开多个不同段的 MemoryMappedViewAccessor 而不会太大影响内存吗?在这种情况下,这些段应该有多大?
如果数据被大量访问然后被丢弃,视图可能会在内存中保留一段时间。
一个可能简单的方法是按偏移量排序对象,并简单地为每个偏移量使用一个小缓冲区调用 CreateViewAccessor。另一种方法是尝试找出所需的不同 MemoryMappedViewAccessor 的最少数量及其大小。但我不确定创建 CreateViewAccessor 的开销以及您可以一次性安全访问多少空间。我可以做一些测试,但如果有人有更好的主意...:)
我想另一种方法是将大数据文件分成几个,但我不确定在这种情况下会有什么好处...
So the scenario is as follows: I have a 2-3 gb large files of binary serialized objects, I also have an index file which contains the id of each object and their offset in the file.
I need to write a method that given a set of id's deserializes them into memory. Performance is the most important benchmark and keeping the memory requirements reasonable is the second.
Using MemoryMappedFile seems the way to go, however I'm a bit unsure on how to handle the large file. I can't create a MemoryMappedViewAccessor for the entire file since it's so large. Can I simultaneously have several MemoryMappedViewAccessor's of different segments open without affecting memory too much, in that case how large should those segments be?
The views might be kept in memory a while if the data is accessed much and then disposed of
A perhaps naive method would be to order the objects to be fetched by offset and simply call CreateViewAccessor for each offset with a small buffer. Another would be to try and figure out the least amount of different MemoryMappedViewAccessor needed and their size.. but I'm unsure of the overhead in creating CreateViewAccessor and how much space you can safely access in one go. I can do some testing but if someone has a better idea... :)
I guess another way to go would to split the large datafile into several but I'm not sure that would do any good in this case...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
文件位于什么类型的存储上?普通 HDD 还是 SSD?如果是普通硬盘,您应该最大限度地减少寻道时间,因此您可能需要按偏移量对访问进行排序。
我认为拥有大的内存映射段不会花费太多内存。它们只消耗地址空间,因为它们可以由文件本身支持。因此,使用最多的 RAM 是操作系统缓存。
据我所知,异步 IO 使用 I/O 完成端口 是最快的,但我自己还没有使用过。
What kind of storage is the file on? A normal HDD or a SSD? In case of a normal hdd you should minimize seek times, so you might need to order your accesses by the offset.
I think having large memory mapped segments doesn't cost much RAM. They only cost address space since they can be backed by the file itself. So the most of the used RAM is the OS cache.
From what I heard async IO using I/O Completion Ports is fastest, but I haven't used them myself yet.
我的问题是为什么你有 2 个 3GB 的序列化对象文件?加载它总是会出现性能问题。
您真的需要立即处理所有这些信息吗?最好的方法可能是某种数据库,您可以使用它在需要时查询所需的元素并在此时重建它们。
您能否提供有关您存储的数据类型以及如何使用这些数据的更多信息。在我看来,你的设计需要做一些工作。
My question to you is why do you have 2 3GB files of serialized objects ? This is always going to be a performance issue loading this up.
Do you really need to handle all this information at once ? The best approach might be some kind of database that you would use to query the elements you needed, when needed and rebuild them at that point.
Can you provide more information on what kind of data you are storing and how you are using it. It seems to me that your design needs a little work.