“XML 往返”有什么影响? 在 Word 2003 文档上?
将 Word 2003 文档保存为 XML,然后返回会导致文件大小减小,而且可能还有更多我不知道的情况。 新文档与旧文档的 WordML 差异仅显示 修订版本中的差异保存 ID 。 那么,在往返中丢失了什么?
如果实际上没有丢失任何内容,那么如何解释文件大小的几千字节呢?
Saving a Word 2003 document to XML and then back results in a reduced file size, and probably more that I don't know about. A diff on the WordML of the new document against the old shows differences only in the revision save ID's. So, what is getting lost in the roundtrip?
If nothing is actually getting lost, then how would one explain the few thousand bytes off the size of the file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
以下只是猜测。
.doc 文件实际上是 OLE 结构化存储 < a href="http://www.forensicswiki.org/images/5/5b/Compdocfileformat.pdf" rel="nofollow noreferrer">复合文件。 后者是一种以明确定义的方式将多个流打包在单个文档中的方法,其结构实际上非常接近文件中的文件系统 - 例如,它具有“扇区”和扇区分配表。 这种方法使得就地编辑文档文件成为可能,而无需完全重写它。
然而,这种存储方法会导致一些冗余,例如未使用的扇区。 当您往返文件时,您可以有效地从头开始重新创建它,从而消除任何此类冗余存储工件。
The following is just a guess.
.doc file is actually OLE structured storage compound file. The latter is a way to pack multiple streams in a single document in a well-defined way, and the structure is actually pretty close to a filesystem-in-a-file - for example, it has "sectors", and sector allocation table. Such an approach makes it possible to edit document file in-place without rewriting it completely.
However, this storage approach results in some redundancy, such as unused sectors. When you roundtrip the file, you effectively recreate it from scratch, and thus any such redundant storage artefacts are eliminated.
据我所知,Word 在 DOC 文件中除了文本和格式之外还存储一些信息,例如用户信息、文档历史记录中的一些内容等。这些信息在使用“文件 > 保存”时会累积。 我认为保存为 XML 并重新保存为 DOC 会删除该信息。
如果我没记错的话,简单的“另存为”就已经减少了文件大小,而且我认为曾经有一些菜单项允许您保存比“文件>保存”小得多的 DOC 文件版本版本。
As far as I know Word stores some information in addition to text and formatting in the DOC files, for example user information, some stuff on the document history, etc. This information accumulates when using "File > Save". I suppose that saving as XML and re-saving as DOC strips that information.
If I recall correctly, as simple "Save As" reduces file size already and I think there used to be some menu item that allowed you to save a version of the DOC file that was significantly smaller in size than the "File > Save" version.
如果您在十六进制编辑器中查看 Word 文档 (.doc),您会发现有很多很多冗余零块。 很棒的格式,文档!
不管怎样,保存到 XML 然后返回到 doc 可能会消除其中的一些零字节。
如果您真的很好奇,只需在十六进制编辑器中打开这两个文件并运行差异算法,您可以尝试 Hex Workshop 和 Hex Editor Neo。
If you look at a word document (.doc) in a hex editor, you will see that there are many, many blocks of redundant zeroes. Great format, doc!
Anyway, saving to XML and then back to doc might get rid of some of those thousands of zeroes bytes.
If you're really curious just open both files in a hex editor and run a difference algorithm, you can try Hex Workshop and Hex Editor Neo.
我对一些大型 Word 2003 文档进行的实验表明,先保存为 XML,然后再保存为 .doc,确实会生成一个稍小(尽管不是很明显)的文件。 正如您所指出的,rsidR 属性不同,但这并不能说明大小的减小,因为新的 rsidR 通常大小相同。
正如 Danra 指出的那样,.doc 文件具有相同的字节。 但是保存为 .doc 的较小文件也有这样的运行,所以我相信这是 .doc 二进制格式的产物,而不是携带信息的数据。 我观察了一些往返的 .doc 文件,在外观上根本看不出任何差异,这支持了这种差异并不携带信息的观点。
检查往返后创建的 XML 文件表明,主要区别在于转换为 XML 后,几个没有内容的 rPr(运行属性)被删除。 似乎保存为 XML 删除了未使用的字符样式和属性。
My experiments with a few large Word 2003 documents shows that saving as XML, then saving that as .doc, indeed results in a slightly, though not significantly, smaller file. As you point out, the rsidR attributes are different, but that does not account for the reduction in size since the new rsidRs are typically the same size.
As Danra points out, .doc files have runs of identical bytes. But the smaller file saved as .doc also has such runs, so I believe this is an artifact of the .doc binary format and not information-carrying data. I eyeballed a few of the round-tripped .doc files and could see no difference in appearance at all, supporting the idea that the differences are not information-carrying.
Examining the XML files created after round-tripping shows the main difference is several rPr (run properties) with no content are removed after converting to XML. It seems saving as XML removes unused character styles and properties.