数据库或其他存储和动态访问巨大二进制对象的方法
我有一些大型(200 GB 是正常的)平面数据文件,我想将它们存储在某种数据库中,以便可以快速访问并以数据逻辑组织的直观方式进行访问。将其视为大量非常长的录音,其中每个录音的长度(样本)相同,并且可以被视为一行。其中一个文件通常包含大约 100,000 个录音,每个录音长度为 2,000,000 个样本。
将这些记录作为 BLOB 数据行存储在关系数据库中是很容易的,但在很多情况下,我只想将整个数据集的某些列(例如,样本 1,000-2,000)加载到内存中。最节省内存和时间的方法是什么?
如果您需要对我的数据的详细信息进行更多说明以便提出建议,请随时询问。
编辑:为了澄清数据维度...一个文件由:100,000 行(记录)x 2,000,000 列(样本)组成。我研究过的大多数关系数据库最多允许表中包含几百到几千行。话又说回来,我对面向对象的数据库了解不多,所以我想知道类似的东西是否会对这里有所帮助。当然,任何好的解决方案都是非常受欢迎的。谢谢。
编辑:为了澄清数据的用法...数据将只能由我将编写的自定义桌面/分布式服务器应用程序访问。每个数据“集”(到目前为止我将其称为 200 GB 文件)都有元数据(收集日期、过滤器、采样率、所有者等)。还有与每个记录相关的元数据(我希望它是表中的一行,这样我就可以为每个记录元数据添加列)。所有元数据都是一致的。即,如果某个录音存在特定的元数据,那么该文件中的所有录音也都存在该元数据。样本本身没有元数据。每个样本都是 8 位的普通二进制数据。
I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
数据库存储可能不适合大文件。是的,这是可以做到的。是的,它可以工作。但是数据库备份呢?文件内容可能不会经常更改 - 一旦添加,它们将保持不变。
我的建议是将文件存储在磁盘上,但创建一个基于数据库的索引。当您有 > 时,大多数文件系统都会变得暴躁或缓慢。文件夹/目录/等中有 10k 个文件。您的应用程序可以生成文件名并将元数据存储在数据库中,然后按磁盘上生成的名称进行组织。缺点是文件内容可能无法从名称中直接看出。但是,您可以轻松备份更改的文件,无需专门的数据库备份插件和复杂的分区、增量备份方案。此外,文件内的查找操作也变得更加简单(向前跳过、倒带等)。通常,文件系统对这些操作的支持比数据库更好。
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
我想知道是什么让您认为 RDBMS 仅限于数千行;没有理由会出现这种情况。
此外,如果您只知道想要的偏移量和长度,至少某些数据库(例如 Oracle)允许直接访问部分 LOB 数据,而无需加载完整的 LOB。因此,您可以拥有一个包含一些可搜索元数据的表,然后是 LOB 列,如果需要,还可以拥有一个包含 LOB 内容元数据的附加元数据表,以便您拥有某种关键字->(offset,length) 关系可用于 LOB 的部分加载。
在某种程度上与这里的另一篇文章相呼应,增量备份(您可能希望在这里拥有)对于数据库来说不太可行(好吧,可能是可能的,但至少根据我的经验往往会附加一个令人讨厌的价格标签)。
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
每个样本有多大,每个录音有多大?
您是说每个录音有 2,000,000 个样本,还是每个文件有 2,000,000 个样本? (可以以任何方式读取)
如果 200 GB 是 200 万个样本,那么每个样本约为 10 K,每个记录为 200K(每个文件有 100,000 个样本,即每个记录 20 个样本)?
对于放入数据库中的一行而不是磁盘上的文件来说,这似乎是一个非常合理的大小。
至于仅将某个范围加载到内存中,如果您已经对样本 ID 建立了索引,那么您可以非常快速地仅查询所需的子集,从数据库查询结果中仅将该范围加载到内存中。
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
我认为 Microsoft SQL 可以通过与文件流存储结合使用的 varbinary(MAX) 字段类型来满足您的需要。
请阅读 TechNet 以获取更多深度:(http://technet .microsoft.com/en-us/library/bb933993.aspx)。
基本上,您可以正常地将任何描述性字段输入数据库,但实际的 BLOB 存储在 NTFS 中,由 SQL 引擎控制,并且大小仅受 NTFS 文件系统限制。
希望这会有所帮助 - 我知道它在我的脑海中引发了各种可能性。 ;-)
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)