如何利用以下 ECM 技术 - 比较
我有一个理论问题。我有大量各种格式的文档(ODS、MS Office、pdf、html),我想实现 ECM 系统,它不是文档管理系统,而是保存文档元数据和数据(各种语言)的系统以统一的方式(xhtml)进入文件系统和数据库(仅元数据),并进行数据处理(索引、搜索)。
您将使用哪些技术以及将如何进行?这些是我的选择:
仅使用 Apache Tika - 解析这些文档并将元数据和数据提取为 xhtml 格式,然后使用 Lucene 或 Solr 进行索引和全文(最大的缺点是数据库持久性 - 元数据变化很大)
仅将 Apache Solr 与 Tika < a href="http://wiki.apache.org/solr/UpdateRichDocuments" rel="nofollow">解析器 - 我没有这方面的经验。它是否支持像 Apache Nutch 这样的数据库集成?
然后是 Apache UIMA 项目 - 很难找出幕后发生的事情
使用一些已经使用 Apache Tika(alfresco、apache jackrabbit)的 CMS - 但我对它们没有太多经验。无论如何,我确信他们已经解决了 Apache Tika 本身无法解决的问题,例如(doc 与 docx 或不同的元数据类型)。
从 Apache Tika 获得 xhtml 格式后,我还可以使用本机 XML 数据库,例如 eXist db,但我不确定这是一个好的选择,因为这些文档的结构相当扁平。 XML 数据库用于更分层的文档持久性。
I have a theoretical question. I have tons of documents of various formats (ODS, MS office, pdf, html) and I'd like to implement ECM system that is not a document management system but rather system that persists metadata and data of documents (of variety of languages) in a unified manner (xhtml) into filesystem and database (only metadata) and that does data processing (indexing, searching).
What technologies would you use and how would you proceed ? These are my options:
Using only Apache Tika - parsing these document and extract metadata and data into xhtml format and then use Lucene or Solr for indexing and fulltext (big disadvantage is database persistence - metadata varies a lot)
Using only Apache Solr with Tika parsers - I don't have experience with it. Does it have a support for database integration like Apache Nutch ?
Then there is Apache UIMA project - very hard to find out what is going on under the hood
Using some CMS that is already using Apache Tika (alfresco, apache jackrabbit) - But I don't have much experience with them. Anyway I'm sure that they have already taken care of problems like (doc vs. docx or different metadata types ) that Apache Tika itself doesn't take care of.
I could also use native XML database like eXist db after I get the xhtml format from Apache Tika, but I'm not sure that it is a good choice because the structure of these document is rather flat. XML database is for more hierarchical document persistence.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您需要“开箱即用”的解决方案,您可以考虑使用像 Camel 这样的集成框架,并建立一个 Camel 路由来从文件中提取实体(使用 tika),并通过 jdbc 将它们迁移到您的数据库。否则,这听起来像是一个典型的数据挖掘任务,从原始源数据开始,以提取的实体结束。
If you need an "out of the box" solution, you could consider using an integration framework like Camel and establish a camel route for extracting entities from files (using tika) and migrate them on to your database through jdbc. Otherwise, it sounds like a typical data mining task starting with raw source data and ending with extracted entities.