当前位置：文江博客话题详情

版本控制压缩文件（docx、odt）

发布于 2024-09-24 02:14:37 字数 739 浏览 10 评论 0 原文

有些格式实际上是伪装的 zip 文件，例如 docx 或 odt。如果我将它们直接存储在版本控制中，它们将被作为二进制文件处理。我理想的解决方案是

有一个钩子，在提交之前为每个 foo.docx/ 文件创建一个 foo.docx/ 目录，
可选择将所有文件解压缩到其中，有一个钩子重新缩进 xml 文件
有一个钩子，可以在更新后从存储的文件重新创建 foo.docx

我不希望 docx 文件本身受到版本控制。（我知道相关问题，其中使用自定义差异的不同方法建议。）

这可行吗？这对 Mercurial 可行吗？

更新：

我了解钩子。我对具体细节感兴趣。这是一个演示预期行为的会话。

> hg add foo.docx
> hg status
A foo.docx
> hg commit
> # Change foo.docx with external editor
> hg status
M foo.docx
> hg diff
+++ foo.docx/word/document.xml
- <w:t>An idea</w:t>
+ <w:t>A much better idea</w:t>

原文

There are formats that are actually zip files in disguise, e.g. docx or odt. If I store them directly in version control, they are handled as binary files. My ideal solution would be

have a hook that creates a foo.docx/ directory for each foo.docx files before commit, unzipping all files into it
optionally, have a hook that reindents the xml files
have a hook that recreates foo.docx from the stored files after update

I don't want the docx files themselves to be version-controlled. (I am aware of a related question where a different approach with a custom diff was suggested.)

Is this doable? Is this doable with mercurial?

UPDATE:

I know about hooks. I am interested in the specifics. Here is a session to demonstrate the expected behavior.

> hg add foo.docx
> hg status
A foo.docx
> hg commit
> # Change foo.docx with external editor
> hg status
M foo.docx
> hg diff
+++ foo.docx/word/document.xml
- <w:t>An idea</w:t>
+ <w:t>A much better idea</w:t>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

单挑你×的.吻 2024-10-01 02:14:37

我想知道同样的事情，刚刚遇到了 Mercurial 的 ZipDoc 扩展/过滤器，这似乎正是这样做的！

还没有尝试过，但看起来很有希望！

回复收藏 0 原文

So尛奶瓶 2024-10-01 02:14:37

如果您能够克服成功解压和压缩 Openoffice 文档的障碍，那么您应该能够使用过滤系统。这使您可以在每次从存储库读取/写入存储库时转换文件。

不幸的是，您要做的不仅仅是解压缩 foo.docx 文件。问题是您需要生成一个文件作为输出 - 因此也许您可以解压缩 foo.docx，然后tar 生成的文件。然后，您将对 tarball 进行版本控制，这应该可以工作，因为 tarball 只是所有单个文件与一些元信息的未压缩串联。想想看，一个更简单的解决方案是再次压缩解压的 foo.docx 文件，但不指定压缩。这应该会产生与使用 tar 类似的结果。

解决这个问题是我自己想做的事情，所以请通过发送邮件到进行报告Mercurial 邮件列表。

回复收藏 0 原文

信仰 2024-10-01 02:14:37

您可以使用预提交挂钩来解压缩，并使用更新挂钩来压缩。请参阅明确指南了解如何使用钩子。

重命名时要小心。如果您将 foo.docx 重命名为 bar.docx，您的预提交挂钩将需要删除 foo.docx/ 并添加 bar。 docx/.

更新（很抱歉为 1k-rep 用户提供入门级答案）

如果您想使用未打包的 docx 进行核心 hg 操作，例如 diff （status 可以与打包的文件），你必须使用扩展名。我认为您可以采取与 keyword 扩展类似的方法用您自己的对象包装 repo 对象。

我已经编写了一些扩展，但还没有达到核心级别，所以我无法提供更多细节。

如果你想变得疯狂，你甚至可以与解压的文件合并。但将其视为二进制文件并使用外部工具差异和合并。

回复收藏 0 原文

徒留西风 2024-10-01 02:14:37

在过去的几天里，我一直在努力解决这个问题，并编写了一个小型 .NET 实用程序来提取和规范化 Excel 文件，以便更容易将它们存储在源代码管理中。我在这里发布了可执行文件：

https://bitbucket.org/htilabs/ooxmlunpack/downloads /OoXmlUnpack.exe

..以及此处的源：

https://bitbucket.org/htilabs/ooxmlunpack< /a>

如果有任何兴趣，我很乐意使其更加可配置，但目前，您应该将可执行文件放在一个文件夹中（例如源存储库的根目录），当您运行它时，它会：

扫描任何 .xlsx 和 .xlsm 文件的文件夹及其子文件夹
将文件复制为 *.orig
解压每个文件并在不压缩的情况下重新压缩
漂亮打印存档中有效 XML 的任何文件
删除 calcchain.xml从存档中提取文件（因为它变化很大并且不会影响文件的内容）
内联任何未格式化的文本值（否则这些值将保存在查找表中，即使修改单个单元格也会导致内部 XML 发生巨大变化）
删除包含公式的任何单元格中的值（因为它们只能在下次打开工作表时计算）
创建子文件夹 *.extracted，包含提取的 zip 存档内容

显然并非所有这些都是必需的，但最终结果是一个电子表格文件，仍将在 Excel 中打开，但更适合比较和增量压缩。此外，存储提取的文件也使得版本历史记录中每个版本中应用的更改更加明显。

如果有任何兴趣，我很高兴使该工具更具可配置性，因为我想不是每个人都希望提取内容，或者可能希望从公式单元格中删除值，但目前这些对我来说都非常有用。

在测试中，2MB 的电子表格“解压”为 21MB，但随后我能够在 1.9MB 的 Mercurial 数据文件中存储其五个版本，每个版本之间都有微小的变化，并在文本模式下使用 Beyond Compare 有效地可视化版本之间的差异。

I've been struggling with this exact problem for the last few days and have written a small .NET utility to extract and normalise Excel files in such a way that they're much easier to store in source control. I've published the executable here:

https://bitbucket.org/htilabs/ooxmlunpack/downloads/OoXmlUnpack.exe

..and the source here:

https://bitbucket.org/htilabs/ooxmlunpack

If there's any interest I'm happy to make this more configurable, but at the moment, you should put the executable in a folder (e.g. the root of your source repository) and when you run it, it will:

Scan the folder and its subfolders for any .xlsx and .xlsm files
Take a copy of the file as *.orig
Unzip each file and re-zip it with no compression
Pretty-print any files in the archive which are valid XML
Delete the calcchain.xml file from the archive (since it changes a lot and doesn't affect the content of the file)
Inline any unformatted text values (otherwise these are kept in a lookup table which causes big changes in the internal XML if even a single cell is modified)
Delete the values from any cells which contain formulas (since they can just be calculated when the sheet is next opened)
Create a subfolder *.extracted, containing the extracted zip archive contents

Clearly not all of these things are necessary, but the end result is a spreadsheet file that will still open in Excel but which is much more amenable to diffing and incremental compression. Also, storing the extracted files as well makes it much more obvious in the version history what changes have been applied in each version.

If there's any appetite out there, I'm happy to make the tool more configurable since I guess not everyone will want the contents extracted, or possibly the values removed from formula cells, but these are both very useful to me at the moment.

In tests, a 2MB spreadsheet 'unpacks' to 21MB but then I was able to store five versions of it with small changes between each, in a 1.9MB mercurial data file, and visualise the differences between versions effectively using Beyond Compare in text mode.

回复收藏 0 原文

~没有更多了~

关于作者

凑诗

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

版本控制压缩文件（docx、odt）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

版本控制压缩文件（docx、odt）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。