如何有效地存储编辑历史记录?
我只是想知道像 stackoverflow 和 wikipedia 这样的网站,它们无限期地存储编辑历史记录并允许用户回滚编辑。 有人可以推荐有关如何使用任何合适的技术(例如数据库等)执行此操作的任何资源/书籍/文章吗?
非常感谢!
I was just wondering for sites like stackoverflow and wikipedia, they stores history of edits indefinitely and allows user to roll back the edits. Can someone recommend any resources/books/articles regarding how to do this using any suitable technology (such as databases etc)
Thanks a lot!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有多种选择,最简单的当然是简单地独立记录所有版本。 对于像 Stack Overflow 这样的网站来说,帖子通常不会被多次编辑,这是合适的。 然而,对于像维基百科这样的东西,人们需要更聪明才能节省空间。
就维基百科而言,页面最初与每个版本分开存储在文本表。 定期将许多较旧的修订版本压缩在一起,然后打包到单个字段中。 由于会有很多重复,这样可以节省很多空间。
您可能还想了解一些版本控制系统是如何做到这一点的 - 例如,subversion 使用 跳过增量,其中修订版本存储为与历史记录中间修订版本的差异。 这意味着人们必须检查最多 O(log n) 次修订才能重建自己感兴趣的修订。
另一方面,Git 使用更类似于维基百科的方法。
修订版本首先存储为单独压缩的“松散”对象,然后定期 git 获取所有松散对象,根据有点复杂的启发式对它们进行排序,然后在“附近”对象之间构建压缩的增量并将结果转储为 packfile。
重建文件所需读取的修订版本数量受包构建过程的参数限制。 这有一个有趣的特性,在某些情况下,可以在不相关的对象之间构建增量。
There are a number of options, the simplest, of course, being to simply record all versions independently. For a site like Stack Overflow, where posts aren't usually edited very many times, this is appropriate. However for something like Wikipedia, one needs to be more clever to save space.
In the case of Wikipedia, pages are initially stored with each version separate, in the text table. Periodically, a number of older revisions are compressed together, then packed into a single field. Since there will be a lot of repetition, you save a lot of space this way.
You might also want to look into how some version control systems do it - for example, subversion uses skip deltas, where revisions are stored as a difference from a revision halfway down the history. This means that one will have to examine at most O(log n) revisions to reconstruct one's revision of interest.
Git, on the other hand, uses something more similar to Wikipedia's approach.
Revisions are stored as individually compressed 'loose' objects at first, then periodically git takes all of the loose objects, sorts them according to a somewhat complex heuristic, then builds compressed deltas between 'nearby' objects and dumps the result as a packfile.
The number of revisions that need to be read to reconstruct a file is bounded by an argument to the pack building process. This has the interesting property that deltas can be built between objects that are unrelated, in some cases.