Alfresco 社区 4.0 无法识别 DITA 文件 mimetype
因此,我安装了 Community 4.0.a 并使用 mimetype-map.xml 扩展了 mimetype 列表,就像我之前在 3.4
<alfresco-config area="mimetype-map">
<config evaluator="string-compare" condition="Mimetype Map">
<mimetypes>
<mimetype mimetype="application/dita+xml" text="true" display="DITA">
<extension default="true" display="DITA Topic">dita</extension>
<extension default="true" display="DITA Map">ditamap</extension>
<extension default="true" display="DITA Conditional Processing Profile">ditaval</extension>
</mimetype>
等中所做的那样...
但是每次我导入 DITA 文件时,它要么被识别为 XML 文件,要么纯文本。我深入研究了它,看起来这是因为 Apache TIKA 分析文件的开头以检查它的 mimetype。
如何使用自定义 mimetype-map 快捷方式 TIKA(从代码来看,TIKA 首先被触发,如果它发现了某些东西,那么游戏就结束了)?
我是否必须扩展 TIKA 编写自己的解析器?
So I've installed the Community 4.0.a and extended the mimetype list using mimetype-map.xml as I did before in 3.4
<alfresco-config area="mimetype-map">
<config evaluator="string-compare" condition="Mimetype Map">
<mimetypes>
<mimetype mimetype="application/dita+xml" text="true" display="DITA">
<extension default="true" display="DITA Topic">dita</extension>
<extension default="true" display="DITA Map">ditamap</extension>
<extension default="true" display="DITA Conditional Processing Profile">ditaval</extension>
</mimetype>
etc...
But each time I import a DITA file, it is either recognise as an XML file, or PLAIN TEXT. I've digged into it and it looks like it's because of Apache TIKA which analyze the beginning of the file to check it's mimetype.
How do I shortcut TIKA with my custom mimetype-map (as it looks from the code that TIKA is triggered first and if it found something then it's game over)?
DO I have to extend TIKA writing my own parser?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
4.0 中的 Mimetype 匹配逻辑略有变化,现在内容可用于检测,而不仅仅是文件名。作为其中的一部分,如果 Tika 非常确定文件是什么,那么这将是首选。
在大多数情况下,这意味着对于常见但命名不正确的文件,Tika 可以帮助纠正错误。对于非标准文件,Tika 将拒绝提供强烈建议,并且将像以前一样使用基于 Alfresco 名称的匹配。 (如果 Tika 和 Alfresco 在 mimetype 的规范形式上存在差异,则优先选择 Alfresco 版本)在
少数情况下,文件类型实际上是常见类型的特化,并且 Tika 了解父类型,但不是特定类型。在这种情况下,Tika 强烈建议使用父类型,而我们无法意识到添加到 Alfresco 的新类型是基于此的。 (Tika 有一个模仿类型的层次结构,而 Alfresco 只有一个平面列表)。对于这些少数情况,Tika 也需要指导。
通常的修复方法是报告 Tika 错误,并将文件类型添加到上游。 (对于非常自定义的类型,您还需要添加 Tika custom-mimetypes.xml,它定义了层次结构 + glob。)
在这个 DITA 案例中,我打开了 TIKA-784 并添加了临时修复。 现在也进入了 Alfresco。
The Mimetype matching logic in 4.0 has changed slightly, now that the content is available for detection, rather than just the filename. As part of this, if Tika is very sure about what a file is, then this will be preferred.
In most cases, this means that for common but incorrectly named files, Tika can help correct mistakes. For non standard files, Tika will decline to offer a strong suggestion, and the Alfresco name based matching will be used as before. (In cases where Tika and Alfresco differ on what the canonical form of the mimetype is, the Alfresco version is preferred though)
There are a small number of cases where the file type is actually a specialisation of a common type, and Tika knows about the parent type but not the specific one. In this case, Tika strongly suggests the parent type, and we've no way to realise the new type added to Alfresco is based on that. (Tika has a hierarchy of mimetypes, while Alfresco just has a flat list). For these small number of cases, Tika needs guiding too.
The usual fix is to report a Tika bug, and have the filetype added upstream. (For very custom types, you need to add a Tika custom-mimetypes.xml too, which defines the hierarchy + glob.)
In this DITA case, I've opened TIKA-784 and added a provisional fix. This has now gone into Alfresco too.