我需要能够根据文件的内容而不是文件的扩展名来识别给定文件是 ODF 文件。
ODF 文件实际上是 zip 容器中 XML 文件的集合,这意味着我无法使用该文件的幻数,因为它只会表明它是一个 zip 文件。
所以我真正要问的是是否有任何文件需要存在于ODF容器中?如果是这样,则该文件存在于zip容器中表明它很可能是 ODF 文件,并且缺少该文件表明它肯定不是 ODF 文件。
I need to be able to identify that a given file is an ODF file based on the contents of the file, and not on the file's extension.
ODF files are really a collection of XML files in a zip container, which means that I cannot use the file's magic number as it will just indicate that it is a zip file.
So what I'm really asking is are there any files that are required to be present in an ODF container? If so the presence of that file in a zip container indicates that it is likely to be an ODF file, and the absence of that file indicates that it definitely is not an ODF file.
发布评论
评论(3)
为什么不查看ODF 技术规范?那里列出的 mimetype 文件可能是一种理想的检查方法(只需在 mimetype 中查找
vnd.oasis.opendocument
字符串)。Why not check out the ODF Technical Specification? The mimetype file listed there would probably be an ideal way to check (just look for the
vnd.oasis.opendocument
string in the mimetype).据我了解,存档的根目录中总会有 .xml 文件,并且这个/这些 xml 文件将始终在开头附近包含字符串 。
我所看到的所有这些似乎都在根目录中包含一个名为“content.xml”的文件,该文件确实包含此字符串。
编写ODF文档的应用程序并不多,过去基本上只有一个。因此,安装一些旧版本的 OpenOffice、保存一些文件并检查该规则是否适用于当前 ODF 文件应该不会太困难。
我会在一批已知的 ODF 文件上使用类似的方法进行测试,以检查它是否可靠:
As I understand it, there will always be .xml file(s) in the root of the archive, and this/these xml files will always contain the string
<office:document
very near the beginning.All those I have seen seem to contain a file called "content.xml" in the root, which does contain this string.
There are not so many applications writing ODF documents, and in the past, there was basically just one. So it shouldn't be too difficult to install some ancient version of OpenOffice, save a few files, and check that this rule applies as it does on current ODF files.
I would test with something like this on a batch of know ODF files, to check if it is reliable:
读取构建 ID - 如果丢失,则该文档不是 ODF。
Read the Build ID - if missing, the document is not ODF.