使用 Git 作为更新服务器的后端,如何保持存储库较小
这是我的用例。我有一个桌面应用程序,可以从我的服务器点播媒体内容。大约每周,新媒体都会在服务器上推送/重命名/修改等,客户端每天左右都会向我发送请求,以检查是否有可用的更新可供下载。
为了准确、轻松地确定客户端需要的新文件,我正在考虑在服务器上使用 Git,并为每个客户端存储其已下载数据的修订哈希值。对于每个更新请求,我可以使用 git diff --name-status -C HEAD
我的问题是:显然,我不需要在服务器上保留媒体的整个二进制历史记录。我不在乎文件 X 两个月前是什么样子;我只需要知道它是否在此期间发生了更改,或者例如从 Y 重命名为 X。是否可以使用 Git 来摆脱文件的“二进制历史记录”,同时仍然跟踪哪些文件被修改、添加、删除和重命名?或者对于这种情况,是否有另一种明显的技术选择被我忽略了?
(是的,我很想在整个过程中使用 rsync;不幸的是,我从客户那里知道的唯一一件事是它们在 JVM 上运行,可能使用端口 80,并且可以写入应包含所需内容的目录媒体文件,因此遗憾的是 rsync 不是一个选项。)
Here's my use case. I have a desktop app that can download from my server media content on-demand. Every week or so, new media will be pushed/renamed/modified etc. on the server, and the clients will send me requests every day or so to check whether there are updates available that they should download.
To accurately and easily determine the new files the clients need, I was thinking of using Git on the server, and storing for each client the revision hash of the data it has downloaded. On every update request, I can then easily check with Git what files were added, deleted, renamed, etc. with something like git diff --name-status -C HEAD <clientRevision>
, and then send only the needed updates.
My question is: obviously, I don't need to keep the whole binary history of my media on the server. I don't care what file X looked like two months ago; I just need to know whether it was changed in the meantime, or renamed from Y to X, for instance. Is it possible to use Git in an a way such that I could get rid of the “binary history” of files while still keeping track of which files were modified, added, removed, and renamed? Or is there another obvious technological choice that I've overlooked for this kind of scenario?
(Yes, I'd love to use rsync for the whole thing; unfortunately the only thing I know from my clients is that they're running on the JVM, may use port 80, and can write to the directory that should contain the needed media files, so rsync is unfortunately not an option.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
请参阅我的评论以获得真正的答案,但评论不允许正确的格式。
如果您想使用 git,这里有一个疯狂想法的快速概述。
我知道您确实可以控制客户端设备,并且可以在这些设备上运行 git。您可以考虑创建原始二进制文件的哈希值(例如 md5/sha1 哈希值)的镜像树。然后,Git 将查看“哈希树”以确定新内容,并确保在更新 git 之前获取实际数据。像这样
See my comment for the real answer, but comments don't allow for proper formatting.
Here's a quick sketch of a crazy idea if you want to go with
git
.I understand that you do have control over the client devices and that you can run git on those divices. You could consider creating a mirror tree of hashes (e.g. md5/sha1 hashes) of the original binary files. Git will then look at the "hashtree" to determine what's new, and make sure to get the actual data before updating the git. Like so
Git 很棒,但不是适合这项工作的工具。如果您对历史不感兴趣,并且拥有大型二进制文件,那么 git 只会引起问题。
相反,我建议使用一个小型 SQL 数据库来存储元信息,并使用一个磁盘目录来存储媒体文件。
首先是远程媒体文件:为了允许损坏检测并支持重命名而无需重新传输大型媒体文件,请按 SHA(或 MD5 或几乎任何合适的校验和算法)命名文件。您可以链接“真实”文件名,也可以使用转换表(可能来自数据库,也可能不是)来向用户呈现好名称。
二、SQL数据库。跟踪表中每个客户端的修订(序列)号。跟踪每个媒体文件上次更新的修订 ID。 。跟踪每个媒体文件的当前名称以及上次添加、重命名或删除该名称的时间(文件名 NULL 表示删除)。
使用此功能,您可以立即确切地知道需要将哪些媒体文件发送给用户
您可以立即准确地知道需要发送哪些新文件映射:
如果您怀疑损坏(或定期),您可以在客户端上验证媒体并服务器我计算 SHA 并将其与文件名进行比较,然后在映射表(客户端和服务器)和媒体表(服务器)中查找。另外,只需发送最新的映射文件(或映射文件的分区或映射文件的校验和)即可验证那里发生的情况。简单、易于理解、易于开发。
Git is great, but not the right tool for the job. If you are uninterested in history, and have large binaries, git is just going to cause problems.
Instead, what I recommend a small SQL database for meta-information and a on-disk directory to store the media files.
First the on-dist media files: to allow corruption detection and support renaming without retransfering large media files, name the files by their SHA (or MD5 or almost any decent checksum algorithm). You can either link the "real" filename or use a translation table (possibly from a DB, possibly not) to present the good name to the user.
Second, the SQL database. Track a revision (sequence) number for each client in a table. Track the revision id each media file was last updated at. . Track the current name of each media file and the last time that name was added, renamed, or deleted (filename NULL for delete) .
Using this, you can instantly tell exactly what media files need to be sent to the user
You can instantly tell exactly what new file mappings need to be sent:
If you ever suspect corruption (or periodically) you can validate the media on the client and server my computing the SHA and comparing it to the filename, and then looking it up in the mapping table (both client and server) and media table (server). Also, just send the latest mapping file (or partition of the mapping file or checksum of the mapping file) to validate what is going on there. Simple, easy to understand, and easy to develop.
您好,
您有一个从客户端到您管理的服务器的端口 80。我想你可能在客户端上使用除 git 之外的其他客户端。
不要使用 git 从服务器提取数据。尝试使用普通的 HTTP 客户端和为此设计的 HTTP 方法:HEAD 来查明文件是否已更改,如果是,则获取它。可以为您的服务器存储库提供一些布局:下载该特定客户端的索引文件,然后检查该索引中的每个文件。从 Debian Apt 存储库中获取灵感 - 差异、文件签名等(如果它适合您的用例)。 WebDav 是访问服务器的另一种选择,提供更舒适的体验。您没有谈论可能需要的身份验证。如果客户端使用 HTTP,您可以使用(缓存)代理。
您可以将数据(通过 HTTP 服务器呈现的树)保存在 git 存储库中。用包含哈希值的小文件替换二进制文件(按照 Klaas van Schelven 的建议),您甚至可以添加其他元数据、更改日志、时间戳或文件作者等。
Hallo,
You have a port 80 from client to a server you manage. I suppose you may use other client than git on the client.
Do not use git to pull data from the server. Try to use plain HTTP client and HTTP methods designed for this: HEAD to find out if the file has changed, and if yes, GET it. There is a possibility to give your server repository some layout: download the index file for that particular client, and then check each file in that index. Get inspiration from a Debian Apt repositories - the diffs, signing of files etc., if it would work for your use case. WebDav is another option to access the server, offering even more comfort. You do not talk about authentication which may be required. If the client speak HTTP, you may use (caching) proxy.
You may keep your data, the tree presented over HTTP server, in git repository. Replacing the binaries with small file containing hashes (as suggested by Klaas van Schelven), and you can even add another metadata, change log, time stams or authors of files etc.