NoSQL 用于文件系统存储组织和复制?
我们一直在小组内讨论数据仓库策略的设计,以满足测试、可重复性和数据同步要求。建议的想法之一是使用现有工具来适应NoSQL方法,而不是尝试重新实现文件系统上的所有内容都是相同的。我不知道 NoSQL 方法是否是我们想要实现的目标的最佳方法,但也许如果我描述一下我们需要/想要的内容,大家都会有所帮助。
- 我们的大多数文件都很大,大小超过 50 Gig,以专有的第三方格式保存。我们需要能够通过名称/日期/源/时间/工件组合来访问每个文件。本质上是键值对样式的查找。
- 当我们查询一个文件时,我们不想将其全部加载到内存中。它们确实太大了,会淹没我们的服务器。我们希望能够以某种方式获取该文件的引用,然后使用专有的第三方 API 来提取其中的部分内容。
- 我们希望轻松地从存储中添加、删除和导出文件。
- 我们希望在两台服务器之间设置自动文件复制(我们可以为此编写一个脚本)。也就是说,将一台服务器的内容与另一台服务器同步。我们不需要一个分布式系统,它看起来就像我们只有一台服务器。我们想要完整的复制。
- 我们还有其他较小的文件,它们与大文件具有树类型关系。一个文件的内容将指向下一个文件,依此类推。它不是一个“辐条轮”,而是一棵成熟的树。
我们更喜欢 Python、C 或 C++ API 来与这样的系统一起工作,但我们大多数人都拥有使用多种语言的经验。只要它有效、能完成工作并节省我们的时间,我们就不介意。你认为呢?外面有这样的东西吗?
We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
- Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
- When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
- We want to easily add, remove, and export files from storage.
- We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
- We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您看过 MongoDB 的 GridFS 吗?
http://www.mongodb.org/display/DOCS/GridFS+Specification
您可以通过默认元数据以及您自己的附加元数据来查询文件。文件被分成小块,您可以指定所需的部分。此外,文件存储在集合中(类似于 RDBMS 表),并且您可以启动 Mongo 的复制功能。
Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
经过验证的集群文件系统有什么问题? Lustre 和 ceph 是很好的候选者。
如果您正在寻找对象存储,那么 Hadoop 就是在构建时考虑到这一点的。根据我的经验,Hadoop 的使用和维护非常痛苦。
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
对我来说,Lustre 和 Ceph 都有一些像 Cassandra 这样的数据库没有的问题。我认为这里的核心问题是 Cassandra 和其他类似的数据库作为 FS 后端有什么缺点。
性能显然可能是其中之一。空间使用情况如何?一致性?
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?