对包含对象的大量小型 .mat 文件进行高效磁盘访问
我正在尝试确定存储大量小型 .mat 文件(大约 9000 个大小从 2k 到 100k 不等的对象,总共大约半个 gig)的最佳方法。
典型的用例是我一次只需要从磁盘中提取少量(例如 10 个)文件。
我尝试过的方法:
方法 1:如果我单独保存每个文件,我会遇到性能问题(保存时间非常慢并且系统迟缓一段时间),因为 Windows 7 难以处理文件夹中的文件(我认为我的SSD 也经历了一段艰难的时期)。不过,最终的结果很好,我可以很快加载我需要的东西。这是使用“-v6”保存。
方法 2:如果我将所有文件保存在一个 .mat 文件中,然后仅加载我需要的变量,则访问速度非常慢(加载时间大约是加载整个文件所需时间的四分之三,变化很小,具体取决于保存的顺序)。这也使用“-v6”保存。
我知道我可以将文件分成许多文件夹,但这似乎是一个令人讨厌的黑客行为(并且无法解决SSD不喜欢写入许多小文件的问题),有更好的方法吗?
编辑: 这些对象主要由双精度数据的数字矩阵和 uint32 标识符的伴随向量以及一堆小的标识属性(字符和数字)组成。
I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
需要考虑的五个想法:
save
以将对象保存在两个位置)。更新:OP 提到了自定义对象。有两种方法可以考虑对它们进行序列化:
Five ideas to consider:
save
to save objects in two places).Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
尝试将它们作为 blob 存储在数据库中。
我也会尝试多文件夹方法 - 它的性能可能比你想象的更好。如果您需要的话,它也可能有助于组织文件。
Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.
我提出的解决方案是保存每个大约 100 个对象的对象数组。这些文件往往有 5-6 兆,因此加载并不禁止,访问只需加载正确的数组,然后将它们子集到所需的条目即可。这种妥协避免了写入太多小文件,仍然允许快速访问单个对象,并避免任何额外的数据库或序列化开销。
The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.