对包含对象的大量小型 .mat 文件进行高效磁盘访问

发布于 2024-11-29 20:58:18 字数 523 浏览 7 评论 0原文

我正在尝试确定存储大量小型 .mat 文件(大约 9000 个大小从 2k 到 100k 不等的对象,总共大约半个 gig)的最佳方法。

典型的用例是我一次只需要从磁盘中提取少量(例如 10 个)文件。

我尝试过的方法:

方法 1:如果我单独保存每个文件,我会遇到性能问题(保存时间非常慢并且系统迟缓一段时间),因为 Windows 7 难以处理文件夹中的文件(我认为我的SSD 也经历了一段艰难的时期)。不过,最终的结果很好,我可以很快加载我需要的东西。这是使用“-v6”保存。

方法 2:如果我将所有文件保存在一个 .mat 文件中,然后仅加载我需要的变量,则访问速度非常慢(加载时间大约是加载整个文件所需时间的四分之三,变化很小,具体取决于保存的顺序)。这也使用“-v6”保存。

我知道我可以将文件分成许多文件夹,但这似乎是一个令人讨厌的黑客行为(并且无法解决SSD不喜欢写入许多小文件的问题),有更好的方法吗?

编辑: 这些对象主要由双精度数据的数字矩阵和 uint32 标识符的伴随向量以及一堆小的标识属性(字符和数字)组成。

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.

The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.

What I've tried:

Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.

Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.

I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?

Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

李不 2024-12-06 20:58:18

需要考虑的五个想法:

  1. 尝试存储在 HDF5 对象中 - 看看 http:// /www.mathworks.com/help/techdoc/ref/hdf5.html - 您可能会发现这可以解决您的所有问题。它还与许多其他系统(例如Python、Java、R)兼容。
  2. 方法 #2 的一种变体是将它们存储在一个或多个文件中,但关闭压缩。
  3. 不同的数据类型:也可能有一些对象的压缩或解压缩效果莫名其妙地糟糕。我在元胞数组或结构数组方面遇到过这样的问题。我最终找到了解决方法,但已经有一段时间了&我不记得如何重现这个特定问题。解决方案是使用不同的数据结构。
  4. @SB 提出了一个数据库。如果其他方法都失败了,请尝试一下。我不喜欢构建外部依赖项和附加接口,但它应该可以工作(主要问题是,如果数据库开始抱怨或损坏您的数据,那么您将回到第 1 方)。为此,请考虑 SQLite,它不需要单独的服务器/客户端框架。 Matlab Central 上有一个可用的界面:http://www.mathworks。 com/matlabcentral/linkexchange/links/1549-matlab-sqlite
  5. (新)考虑到对象小于 1GB,将整个集合复制到然后通过 RAM 磁盘进行访问。如果保存了任何内容,请记住从 RAM 磁盘进行复制(或包装 save 以将对象保存在两个位置)。

更新:OP 提到了自定义对象。有两种方法可以考虑对它们进行序列化:

  1. Matlab Central 的两个序列化程序:http://www.matlabcentral.com/fileexchange/29457 mathworks.com/matlabcentral/fileexchange/29457 - 灵感来自:http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
  2. Google 协议缓冲器。看一下这里:http://code.google.com/p/protobuf-matlab/< /a>

Five ideas to consider:

  1. Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
  2. A variation on your method #2 is to store them in one or more files, but to turn off compression.
  3. Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
  4. @SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
  5. (New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).

Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:

  1. Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
  2. Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/
獨角戲 2024-12-06 20:58:18

尝试将它们作为 blob 存储在数据库中。

我也会尝试多文件夹方法 - 它的性能可能比你想象的更好。如果您需要的话,它也可能有助于组织文件。

Try storing them as blobs in a database.

I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.

云柯 2024-12-06 20:58:18

我提出的解决方案是保存每个大约 100 个对象的对象数组。这些文件往往有 5-6 兆,因此加载并不禁止,访问只需加载正确的数组,然后将它们子集到所需的条目即可。这种妥协避免了写入太多小文件,仍然允许快速访问单个对象,并避免任何额外的数据库或序列化开销。

The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文