使用 Git 作为更新服务器的后端，如何保持存储库较小

发布于 2024-11-08 07:43:36 字数 580 浏览 4 评论 0原文

这是我的用例。我有一个桌面应用程序，可以从我的服务器点播媒体内容。大约每周，新媒体都会在服务器上推送/重命名/修改等，客户端每天左右都会向我发送请求，以检查是否有可用的更新可供下载。

为了准确、轻松地确定客户端需要的新文件，我正在考虑在服务器上使用 Git，并为每个客户端存储其已下载数据的修订哈希值。对于每个更新请求，我可以使用 git diff --name-status -C HEAD之类的内容轻松地使用 Git 检查添加、删除、重命名了哪些文件等，然后仅发送所需的更新。

我的问题是：显然，我不需要在服务器上保留媒体的整个二进制历史记录。我不在乎文件 X 两个月前是什么样子；我只需要知道它是否在此期间发生了更改，或者例如从 Y 重命名为 X。是否可以使用 Git 来摆脱文件的“二进制历史记录”，同时仍然跟踪哪些文件被修改、添加、删除和重命名？或者对于这种情况，是否有另一种明显的技术选择被我忽略了？

（是的，我很想在整个过程中使用 rsync；不幸的是，我从客户那里知道的唯一一件事是它们在 JVM 上运行，可能使用端口 80，并且可以写入应包含所需内容的目录媒体文件，因此遗憾的是 rsync 不是一个选项。）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦一生花开无言 2024-11-15 07:43:36

请参阅我的评论以获得真正的答案，但评论不允许正确的格式。

如果您想使用 git，这里有一个疯狂想法的快速概述。
我知道您确实可以控制客户端设备，并且可以在这些设备上运行 git。您可以考虑创建原始二进制文件的哈希值（例如 md5/sha1 哈希值）的镜像树。然后，Git 将查看“哈希树”以确定新内容，并确保在更新 git 之前获取实际数据。像这样

/actual/somedir/imag1.jpg


/mirror/somedir/imag1.jpg  <= contains md5 hash

See my comment for the real answer, but comments don't allow for proper formatting.

Here's a quick sketch of a crazy idea if you want to go with git.
I understand that you do have control over the client devices and that you can run git on those divices. You could consider creating a mirror tree of hashes (e.g. md5/sha1 hashes) of the original binary files. Git will then look at the "hashtree" to determine what's new, and make sure to get the actual data before updating the git. Like so

/actual/somedir/imag1.jpg


/mirror/somedir/imag1.jpg  <= contains md5 hash

回复收藏 0 原文

锦上情书 2024-11-15 07:43:36

Git 很棒，但不是适合这项工作的工具。如果您对历史不感兴趣，并且拥有大型二进制文件，那么 git 只会引起问题。

相反，我建议使用一个小型 SQL 数据库来存储元信息，并使用一个磁盘目录来存储媒体文件。

首先是远程媒体文件：为了允许损坏检测并支持重命名而无需重新传输大型媒体文件，请按 SHA（或 MD5 或几乎任何合适的校验和算法）命名文件。您可以链接“真实”文件名，也可以使用转换表（可能来自数据库，也可能不是）来向用户呈现好名称。

二、SQL数据库。跟踪表中每个客户端的修订（序列）号。跟踪每个媒体文件上次更新的修订 ID。。跟踪每个媒体文件的当前名称以及上次添加、重命名或删除该名称的时间（文件名 NULL 表示删除）。

使用此功能，您可以立即确切地知道需要将哪些媒体文件发送给用户

select clientid,mediaid from tmedia join tclients on tmedia.revisionid > tclients.revisionid;

您可以立即准确地知道需要发送哪些新文件映射：

select mediaid,filename,clientid from tmapping join tclients on tmapping.revisionid > tclients.revisionid;

如果您怀疑损坏（或定期），您可以在客户端上验证媒体并服务器我计算 SHA 并将其与文件名进行比较，然后在映射表（客户端和服务器）和媒体表（服务器）中查找。另外，只需发送最新的映射文件（或映射文件的分区或映射文件的校验和）即可验证那里发生的情况。简单、易于理解、易于开发。

Git is great, but not the right tool for the job. If you are uninterested in history, and have large binaries, git is just going to cause problems.

Instead, what I recommend a small SQL database for meta-information and a on-disk directory to store the media files.

First the on-dist media files: to allow corruption detection and support renaming without retransfering large media files, name the files by their SHA (or MD5 or almost any decent checksum algorithm). You can either link the "real" filename or use a translation table (possibly from a DB, possibly not) to present the good name to the user.

Second, the SQL database. Track a revision (sequence) number for each client in a table. Track the revision id each media file was last updated at. . Track the current name of each media file and the last time that name was added, renamed, or deleted (filename NULL for delete) .

Using this, you can instantly tell exactly what media files need to be sent to the user

select clientid,mediaid from tmedia join tclients on tmedia.revisionid > tclients.revisionid;

You can instantly tell exactly what new file mappings need to be sent:

select mediaid,filename,clientid from tmapping join tclients on tmapping.revisionid > tclients.revisionid;

If you ever suspect corruption (or periodically) you can validate the media on the client and server my computing the SHA and comparing it to the filename, and then looking it up in the mapping table (both client and server) and media table (server). Also, just send the latest mapping file (or partition of the mapping file or checksum of the mapping file) to validate what is going on there. Simple, easy to understand, and easy to develop.

回复收藏 0 原文

ヅ她的身影、若隐若现 2024-11-15 07:43:36

您好，

您有一个从客户端到您管理的服务器的端口 80。我想你可能在客户端上使用除 git 之外的其他客户端。

不要使用 git 从服务器提取数据。尝试使用普通的 HTTP 客户端和为此设计的 HTTP 方法：HEAD 来查明文件是否已更改，如果是，则获取它。可以为您的服务器存储库提供一些布局：下载该特定客户端的索引文件，然后检查该索引中的每个文件。从 Debian Apt 存储库中获取灵感 - 差异、文件签名等（如果它适合您的用例）。 WebDav 是访问服务器的另一种选择，提供更舒适的体验。您没有谈论可能需要的身份验证。如果客户端使用 HTTP，您可以使用（缓存）代理。

您可以将数据（通过 HTTP 服务器呈现的树）保存在 git 存储库中。用包含哈希值的小文件替换二进制文件（按照 Klaas van Schelven 的建议），您甚至可以添加其他元数据、更改日志、时间戳或文件作者等。

回复收藏 0 原文

~没有更多了~