应如何设计包装并提供对单个文件的访问的类?
MyClass
旨在提供对单个文件的访问。它必须是CheckHeader()
、ReadSomeData()
、UpdateHeader(WithInfo)
等。
但是由于该类代表的文件非常复杂,它需要特殊的设计考虑。
该文件包含一个潜在的巨大的类似文件夹的树结构,具有各种节点类型,并且基于块/单元以更好地处理碎片。大小通常小于 20 MB。 这不是我的设计。
你会如何设计这样一个类?
- 将 ~20MB 流读入内存?
- 将副本放在临时目录上并将其路径保留为属性?
- 在内存中保留一份大内容的副本并将它们公开为只读属性?
- 从带有异常抛出代码的文件中获取
GetThings()
?
这个类一开始只会由我使用,但如果它足够好,我可能会开源它。
(这是一个设计问题,但平台是.NET,类是关于 XP 的离线注册表访问)
MyClass
is all about providing access to a single file. It must CheckHeader()
, ReadSomeData()
, UpdateHeader(WithInfo)
, etc.
But since the file that this class represents is very complex, it requires special design considerations.
That file contains a potentially huge folder-like tree structure with various node types and is block/cell based to handle fragmentation better. Size is usually smaller than 20 MB. It is not of my design.
How would you design such a class?
- Read a ~20MB stream into memory?
- Put a copy on temp dir and keep its path as property?
- Keep a copy of big things on memory and expose them as read-only properties?
GetThings()
from the file with exception-throwing code?
This class(es) will be used only by me at first, but if it ends good enough I might open-source it.
(This is a question on design, but platform is .NET and class is about offline registry access for XP)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这取决于您需要如何处理这些数据。如果您只需要线性处理一次,那么仅考虑内存中大文件的性能影响可能会更快。
但是,如果您需要对文件执行除单个线性解析之外的各种操作,我会将数据解析到轻量级数据库(例如 SQLite)中,然后对其进行操作。这样,所有文件的结构都会被保留,并且对该文件的所有后续操作都会更快。
It depends what you need to do with this data. If you only need to process it linearly one time, then it might be faster to just take the performance hit of a large file in memory.
If however you need to do various things with the file beyond a single, linear parsing, I would parse the data into a lightweight database such as SQLite and then operate on that. This way all of your file's structure is preserved and all subsequent operations on the file will be faster.
注册表访问相当复杂。您基本上正在阅读一棵大型二叉树。类设计应该严重依赖存储的数据结构。只有这样你才能选择合适的班级设计。为了保持灵活性,您应该对 REG_SZ、REG_EXPAND_SZ、DWORD、SubKey 等原语进行建模。Don Syme 在他的《Expert F#》一书中有一个关于使用二进制组合器进行二进制解析的精彩部分。基本思想是您的对象自己知道如何从二进制表示形式反序列化。当你有一个结构如下的字节流时
您从 BinaryReader 开始逐字节读取二进制对象。由于您知道第一件事必须是标头,因此您可以将其传递给 Header 对象。
为了保持性能,您可以将数据解析延迟到稍后实际访问此或该实例的特定属性时。
由于 Windows 中的注册表可能会变得相当大,因此不可能立即将其完全读入内存。你需要把它分块。 Windows 采用的一种解决方案是,将整个文件分配在可跨越数 GB 的分页池内存中,但只有实际访问的部分从磁盘换出到内存中。这使得 Windows 能够有效地处理非常大的注册表文件。您的读者也需要类似的东西。惰性解析是一方面,在文件中跳转而无需读取其间数据的能力对于保持性能至关重要。
有关分页池和注册表的更多信息可以在那里找到:
http://blogs.technet.com/b/ markrussinovich/archive/2009/03/26/3211216.aspx
您的 Api 设计将取决于您如何读取数据以保持高效(例如,使用 内存映射文件并从不同的映射区域读取)。在 .NET 4 中,内存映射文件实现已经非常好,但围绕操作系统 API 的包装器也存在。
你的,
Alois Kraus
为了支持从内存映射文件的延迟加载,最好不要将字节数组读入对象并稍后解析它,而是更进一步,仅存储内存映射文件中内存块的偏移量和长度。稍后,当实际访问该对象时,您可以读取并反序列化数据。通过这种方式,您可以遍历整个文件并构建仅包含偏移量和对内存映射文件的引用的对象树。这应该可以节省大量内存。
Registry access is quite complex. You are basically reading a large binary tree. The class design should rely heavily on the stored data structures. Only then you can choose an appropriate class design. To stay flexible you should model the primitives such as REG_SZ, REG_EXPAND_SZ, DWORD, SubKey, .... Don Syme has in his book Expert F# a nice section about binary parsing with binary combinators. The basic idea is that your objects know by themself how to deserialize from a binary representation. When you have a stream of bytes which is structured like this
you start with a BinaryReader to read the binary objects byte by byte. Since you know that the first thing must be the header you can pass it to the Header object
To stay performant you can e.g. delay parsing the data up to a later time when specific properties of this or that instance are actually accessed.
Since the registry in Windows can get quite big it is not possible to read it completely into memory at once. You will need to chunk it. One solution that Windows applies is that the whole file is allocated in paged pool memory which can span several gigabytes but only the actually accessed parts are swapped out from disk into memory. That allows Windows to deal with a very large registry file in an efficient manner. You will need something similar for your reader as well. Lazy parsing is one aspect and the ability to jump around in the file without the need to read the data in between is cruical to stay performant.
More infos about paged pool and the registry can be found there:
http://blogs.technet.com/b/markrussinovich/archive/2009/03/26/3211216.aspx
Your Api design will depend on how you read the data to stay efficient (e.g. use a memory mapped file and read from different mapped regions). With .NET 4 a Memory Mapped file implementation has arrived that is quite good now but wrappers around the OS APIs exist as well.
Yours,
Alois Kraus
To support delayed loading from a memory mapped file it would make sense not to read the byte array into the object and parse it later but go one step furhter and store only the offset and length of the memory chunk from the memory mapped file. Later when the object is actually accessed you can read and deserialize the data. This way you can traverse the whole file and build a tree of objects which contain only the offsets and the reference to the memory mapped file. That should save huge amounts of memory.