文件验证/比较
有没有办法比较两个docx文档?
我有一个从模板文档生成的文档,其中一些部分通过模板中的书签和块部分动态删除。
我想将生成的文档与另一个 docx 进行比较,这将是预期的结果。
我隐约听说过校验和比较,
是否有人可以提供一些关于比较两个文档的最佳方法的指示?
谢谢
is there a way to compare two docx documents?
I have one that is generated from a template document where some sections are removed dynamically through bookmarks and block sections from the template.
I would like to compare the generated document with another docx which would be the expected result.
I vaguely heard of checksum comparison,
is there anybody that would have some pointers on the best way to compare 2 documents?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 XMLUnit for .NET 来比较主要文档部分 (document.xml)。
您可以使用 OpenXML SDK 或 System.IO.Packaging 获取主要文档部分。请参阅 C# 替换 docx 中的文本字符串有关后一种方法的更多信息。
You could use XMLUnit for .NET to compare the main document parts (document.xml).
You could get the main document parts using the OpenXML SDK, or System.IO.Packaging. See C# to replace strings of text in a docx for more on the latter approach.
校验和可以很好地比较逐字节的准确性。如果这就是您正在寻找的内容,请将每个文档的字节读入流中,并使用
SHA256Managed
或MD5CryptoServiceProvider
为每个文件生成校验和。如果两个校验和相同,则两个文档很可能是相同的。MD5 不适合安全目的 (http://en.wikipedia.org/wiki/MD5 - 请参阅“安全性”),但出于比较目的,您可以控制两个文档,这应该没问题。另请记住,校验和并非 100% 唯一,因此始终存在极小的冲突可能性。
但是,如果您逐节进行比较,那么您可能需要将文档打开为不仅仅是原始字节,并以结构化方式(例如逐节)处理它。您可以使用 c# 以编程方式打开 .docx 文件(使用多种方式);也许您可以对每个部分的内容执行校验和?
此线程讨论使用 c# 创建/操作 .docx 文件:如何可以用C# 创建Word 文档吗?。可以使用相同的工具来阅读。
Checksums work well for comparison of byte by byte exactness. If that's what you are looking for, then read the bytes of each document into a stream and use a
SHA256Managed
orMD5CryptoServiceProvider
to generate a checksum for each file. If the two checksums are the same, then the two documents are most likely the same.MD5 is not suitable for security purposes (http://en.wikipedia.org/wiki/MD5 - see "Security") but it should be fine for comparison purposes where you are in control of both documents. Also keep in mind that checksums are not 100% unique, so there is always the remote possibility of collision.
However, if you are comparing section by section, then you may need to open the document up as more than raw bytes and deal with it in a structured fashion, e.g. section by section. You can programmatically open a .docx file using c# (using a variety of means); perhaps you can then perform a checksum against the contents of each section?
This thread talks about creating/manipulating .docx files using c#: How can a Word document be created in C#?. The same tools could be used to read one.