文本数据版本控制的最佳实践
对多个大型 (100MB+) CSV 文件中包含的数据进行版本控制的最佳实践是什么?
SVN 是一个好的选择吗?
更新: 经过一段时间的考虑后,我觉得 GZIP/Zip CSV 文件然后将其添加到存储库可能是更好的选择。这样,我就可以省去版本管理的麻烦,同时又不会损失磁盘空间。它至少与手动管理版本一样好,甚至更好。
仍在寻找完美的解决方案。
另外,还有一个小注意事项: 文件内容的版本控制不是必需的。就像我不需要知道文件中哪些单词发生了变化一样,只要我能够记录更改摘要或向每个版本添加注释即可。
What are the best practices for versioning data contained in several large (100MB+) CSV files?
Is SVN a good option?
Update:
After deliberating on this for a while, I feel it may be a better option to GZIP/Zip the CSV file and then add it to the repo. That way, I'd save on the headache of version management while not losing out on diskspace. It's at least as good, if not better, than managing their versions manually.
Still looking out for the perfect solution.
Also, a small note:
Versioning of the file contents is not a requirement. Like I don't need to know what words have changed within the file so long as I am able to record a summary of changes or add a note to each version.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这很大程度上取决于您打算如何使用这些文件。
SVN 和大多数其他源代码控制系统都会为您提供唯一标识文件特定版本的修订号。每次您提交新的 CSV 时,该提交都会有自己的修订号。
但是...
根据使用情况,这可能不是一个好的解决方案。假设您签入了一个 CSV,该文件的 SVN 版本号为 1234。然后有人签出该文件,可能会将其发送给其他人等等。CSV 的持有者不会从 CSV 中知道它是什么版本以及因此不知道他们是否使用的是最新版本。
就我个人而言,我会在文件名中添加一个版本号,或者在包含版本号的 CSV 的开头/结尾添加一行 - 但是这些也取决于您的使用情况。
深思熟虑...
编辑 此外,差异可能存在问题,我不确定 SVN 是否支持 CSV 上的差异,因此每次您签入时,在 SVN 的内部,它可能会完全取代旧文件(保留旧文件以供参考)。这可能会迅速使用大量磁盘空间。
This largely depends on how you intend to use these files.
SVN, and most other source control systems, would give you revision numbers that would uniquely identify a specific version of the file. Everytime you commit a new CSV this commit would have its own revision number.
However...
Depending on usage it might not be a good solution. Lets say you check in a CSV and this is on SVN revision number 1234. Somebody then checks that file out, maybe sends it to somebody else etc etc. The holder of the CSV will not know, from the CSV, what revision it is and therefore will not know if they are using the latest version.
Personally, I would put a version number in the filename or add a row to the start/end of the CSV that contains the a version number - however these also depend on your usage as well.
Food for thought...
EDIT Additionally there might be an issue with diffs, I am not certain if SVN supports diffs on CSV so everytime you check in, withing the bowels of SVN, it might completely replace the older file (keeping the old for reference). That could rapidly use a lot of disk space.
SVN 非常慢,因为它通过网络传输所有数据。
尝试本地 git 或 hg 存储库。这只需要文件访问,这应该比网络快得多。两种存储库类型在移动文件、文件重命名和合并方面也有更好的处理。此外,git 可以使用“插件”来支持更多文件类型,例如合并 Office 文档(odf、doc 等)。
与 SVN 相比,您只有一个隐藏的存储库目录,其中包含压缩存储库。 SVN 在每个子目录中都有一个 .svn 目录,其中包含文件的最后状态(以及其他内容)。
一些随机数:
假设存储库中所有文件(不是存储库信息)的大小为 100MB,
这就是我们使用 SVN 和 git 所经历的。我只是偶尔使用汞。
关于 MrEyes 的答案,我还建议向 CSV 文件或文件名添加一些版本信息。 Git 将识别文件重命名,包括更改等。
SVN is terribly slow because it transfers all the data over the network.
Try a local git or hg repository. This only needs file access, which should be much faster than the network. Both repo types also have a much better handling concerning moving files, file renames and merging. Additionally git can use 'plugins' to support further file types such as merging office documents (odf, doc etc.).
In contrast to SVN you only have one hidden repo dir containing the compressed repository. SVN has a .svn dir in every sub dir containing the last state of the file (and other stuff).
Some random numbers:
Assume the size of all files (not repo info) in the repository is 100MB
This is what we've experienced with SVN and git. I'm using hg (mercurial) only occasionally.
Regarding MrEyes answer, I'd also suggest to add some version info to the CSV file, or file name. Git will identify the file rename including the changes etc.