简单的版本控制系统或版本控制文件系统或版本控制数据库
我正在寻找一个用于大量记录或文件的简单版本控制系统(约 5000 万条,约 100GB 未打包,约 20MB 打包)。每个文件只有几千字节,并且具有唯一的 ID,因此我不介意它们是否存储在平面结构(表、目录...)中。平均而言,每条记录每月更改一次,但大多数更改的差异小于千字节,因此应该很容易压缩版本。然而,每个版本只有一个条目的原始数据库会增长得太快。我需要以下操作:
- 基本 CRUD 操作:创建、读取、更新、删除
- 最近更改的快速列表
- 特定记录的最近更改的快速列表
- 查询给定时间段内的更改
- 查询给定用户的更改(每次编辑与某个用户 ID 相关联,并且可以选择将提交消息作为注释)
- 对于写入操作,必须有一个提交挂钩来验证和拒绝格式不正确的记录。
简而言之,我正在寻找一个类似 Wiki 的软件,用于简单的记录或文件。
我考虑了可能的解决方案:
将文件放入版本控制系统中。这为我提供了复制和许多可用的访问工具,因此它是我的首选解决方案。但数据量对于git这样的分布式系统来说太大了。有人使用 Subversion 成功完成类似任务吗?
在数据库或文件系统中实现我自己的版本控制。我可能只需要存储压缩的记录和差异,会有更多的工作并学习一些东西。如果只是为了好玩,这将是我的首选解决方案。
使用版本控制文件系统。这将使设置、复制和访问变得更加困难。也许我需要在文件系统之上实现我自己的访问 API。
使用版本控制数据库系统。您能建议一些吗?
使用一些其他现有数据存储进行版本控制(MediaWiki?、Amazon Cloud Drive?...)
显然有很多路径。其他人已成功使用哪些路径来处理类似或大量数据?
I am looking for a simple versioning system for a large number of records or files (~50 million, ~100GB unpacked, ~20MB packed). The files are only a few Kilobytes each, and have unique IDs, so I don't mind whether they are stored in a flat structure (table, directory...) or not. On average, each record is changed once a month, but most changes have diffs less than a Kilobyte so it should be easy to compress versions. However, a naive database with one entry for each version would grow too quickly. I need the following operations:
- basic CRUD operations: create, read, update, delete
- quick listing of recent changes
- quick listing of recent changes of a particular record
- query for changes in a given period of time
- query for changes by a given user (each edit is associated to some user id and optionally has a commit message as comment)
- for write operations there must be a commit hook to validate and reject illformed records.
In short, I am looking for a Wiki-like software for simple records or files.
I thought about possible solutions:
Put files in a version control system. This gives me replication and many available access tools, so it is my preferred solution. But the amount of data is too large for distributed systems like git. Is anyone using Subversion for a similar task with success?
Implement my own versioning in a database or in a file system. I would pobably need to store only compressed records and diffs, would have more work and learn something. This would be my preferred solution, if it was just for fun.
Use a versioning file system. This would make setup, replication and access more difficult. Probably I would need to implement my own access API above the file system.
Use a versioning database system. Can you suggest some?
Use some other existing data store with versioning (MediaWiki?, Amazon Cloud Drive?, ...)
Obviously there are many pathes. Which pathes have been used by others with success for similar or larger amounts of data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果你不反对在你的客户端上拥有每个文件的原始副本(我想这是可以的,如果你正在考虑 svn),那么 git 可能是解决你的问题的一个很好的解决方案。底层存储库存储将使用文件之间以及版本之间的二进制差异,因此您应该在那里拥有接近最佳的压缩。
使用裸存储库和一些脚本,您甚至可以不必签出当前版本:可以从命令行获取对象,并且您可以创建新的提交而无需签出。
If you're not averse to having a raw copy of each file on your client (which I imagine is OK, if you're considering svn) then git is probably quite a good solution to your problem. The underlying repository storage will use binary diffs between files as well as between versions, so you should have close to optimal compression there.
With a bare repo and some scripting, you may even be able to get away with not having the current revision checked out: objects are available from the command line and you can create new commits without a checkout.