使用 Apache tika 获取 MimeType 子类型
我需要获取 iana.org MediaType 而不是 application/zip 或 application/x-tika-msoffice 等文档,例如 odt、ppt、pptx、xlsx 等。
如果您查看 mimetypes.xml,会发现由以下组成的 mimeType 元素iana.org mime-type 和“sub-class-of”
<mime-type type="application/msword">
<alias type="application/vnd.ms-word"/>
............................
<glob pattern="*.doc"/>
<glob pattern="*.dot"/>
<sub-class-of type="application/x-tika-msoffice"/>
</mime-type>
如何获取 iana.org mime-type 名称而不是父类型名称?
在测试 mime 类型检测时,我这样做:
MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();
测试结果:
FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>
FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>
FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>
FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>
FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>
有没有办法从 mimetypes.xml 获取实际的子类型?而不是 x-tika-msoffice 或 application/zip ?
此外,我从未获得 application/x-tika-ooxml,而是 xlsx、docx、pptx 文档的 application/zip。
I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.
If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of"
<mime-type type="application/msword">
<alias type="application/vnd.ms-word"/>
............................
<glob pattern="*.doc"/>
<glob pattern="*.dot"/>
<sub-class-of type="application/x-tika-msoffice"/>
</mime-type>
How to get the iana.org mime-type name instead of the parent type name ?
When testing mime type detection, I do :
MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();
Test Results :
FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>
FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>
FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>
FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>
FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>
Is there any way to get the actual subtype from mimetypes.xml ? Instead of x-tika-msoffice or application/zip ?
Moreover I never get application/x-tika-ooxml, but application/zip for xlsx, docx, pptx documents.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
最初,Tika 仅支持 Mime Magic 或文件扩展名 (glob) 检测,因为这是 Tika 之前的大多数 mime 检测。
由于 Mime Magic 和 glob 在检测容器格式时存在问题,因此决定在 Tika 中添加一些新的检测器来处理这些问题。容器感知检测器获取整个文件,打开并处理容器,然后根据内容计算出确切的文件类型。最初,您需要显式调用它们,但随后它们被包装在 ContainerAwareDetector 中,您将在一些答案中看到它。
从那时起,Tika 添加了一个服务加载器模式,最初是针对解析器的。这允许类在存在时自动加载,并通过通用方法来识别哪些类是合适的并使用它们。然后,这种支持也扩展到了探测器,此时旧的 ContainerAwareDetector 可以被删除,以支持更干净的东西。
如果您使用的是 Tika 1.2 或更高版本,并且想要准确检测所有格式(包括容器格式),您需要执行以下操作:
如果您仅使用 Core Tika jar (tika-core-1.2-... .),那么存在的唯一检测器将是 mime magics 检测器,并且您将获得仅基于 magic + glob 的旧式检测。但是,如果您使用 Core 和 Parser Tika jar(及其依赖项)或 Tika App(自动包含核心 + 解析器 + 依赖项)运行此程序,则 DefaultDetector 将使用所有不同的容器检测器来处理您的文件。如果您的文件是基于 zip 的,则检测将包括处理 zip 结构,以根据其中的内容识别文件类型。这将为您提供所需的高精度检测,而无需依次调用许多不同的解析器。
DefaultDetector
将使用所有可用的检测器。Originally, Tika only supported detection by Mime Magic or by file extension (glob), as this is all most mime detection before Tika did.
Because of the problems with Mime Magic and globs when it comes to detecting container formats, it was decided to add some new detectors to Tika to handle these. The Container Aware Detectors took the whole file, opened and processed the container, and then worked out the exact file type based on the contents. Initially, you needed to call them explicitly, but then they were wrapped up in
ContainerAwareDetector
which you'll see in some of the answers.Since then, Tika has added a service loader pattern, initially for Parsers. This allowed classes to be auto-loaded when present, with a general way to identify which ones were appropriate and use those. This support was then extended to cover Detectors too, at which point the old
ContainerAwareDetector
could be removed in favour of something cleaner.If you're on Tika 1.2 or later, and you want accurate detection of all formats, including container formats, you want to do something like:
If you run this with only the Core Tika jar (tika-core-1.2-....), then the only detector present will be the mime magics one, and you'll get the old style detection based on magic + glob only. However, if you run this with both the Core and Parser Tika jars (plus their dependencies), or from Tika App (which includes core + parsers + dependencies automatically), then the DefaultDetector will use all the various different Container Detectors to process your file. If your file is zip based, then detection will include processing the zip structure to identify the file type based on what's in there. This will give you the high accuracy detection you're after, without needing to call lots of different parsers in turn.
DefaultDetector
will use all Detectors that are available.对于其他遇到类似问题但使用较新 Tika 版本的人来说,这应该可以解决问题:
TikaInputStream
提供给检测器的detect()
方法,以确保 tika 可以分析正确的 mime 类型。我的示例代码如下所示:
请注意,
Document
类是我的域模型的一部分。所以你肯定会在该行遇到类似的东西。我希望有人可以使用这个。
For anyone else having a similar problem but using newer Tika version this should do the trick:
ZipContainerDetector
since you may have noContainerAwareDetector
any more.TikaInputStream
to thedetect()
method of the detector to ensure tika can analyze the correct mime type.My example code looks like this:
Note that the
Document
class is part of my domain model. So you will for sure have something similar at that line.I hope that someone can use this.
tika-core 中的默认字节模式检测规则只能检测所有 MS Office 文档类型使用的通用 OLE2 或 ZIP 格式。您想使用 ContainerAwareDetector 进行此类检测。并使用 MimeTypes 检测器作为其后备检测器。试试这个:
这样你的测试应该通过
The default byte pattern detection rules in tika-core can only detect the generic OLE2 or ZIP format used by all MS Office document types. You want to use ContainerAwareDetector for this kind of detection afaik. And use MimeTypes detector as its fallback detector. Try this :
This way your tests should pass
您可以使用自定义 tika 配置文件:
在 WEB-INF/classes 中将“tika-custom-MimeTypes.xml”进行更改:
在我的情况下:
You can use a custom tika config file:
In the WEB-INF/classes put the "tika-custom-MimeTypes.xml" with your changes:
In my case: