使用 Solr CELL 的 ExtractingRequestHandler 从包格式中索引/提取文件
您可以将 ExtractingRequestHandler 和 Tika 与任何一个一起使用吗 压缩文件格式(zip、tar、gz 等)来提取内容以进行索引?
我使用curl 向solr 发送archived.tar 文件。卷曲” http://localhost:8983/solr /update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H '内容类型:应用程序/八位字节流' --data-binary “@/home/archived.tar” 我查询文档时得到的结果是里面的文件名 存档被索引为“body_texts”,但这些文件的内容是 未提取或包含。这不是我所期望的行为。参考: http://www .lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example。 当我使用相同的curl发送存档内的实际文档1时 命令提取的内容然后存储在“body_texts”字段中。是 我缺少压缩文件的步骤吗?
我已经添加了所有提取依赖项,如 mat 所示 http://outoftime.lighthouseapp.com/projects/20339/tickets/98 -solr-cell 和 我能够成功地从 MS Word、PDF、HTML 文档中提取数据。
我正在使用以下库版本。 Solr 1.40、Solr Cell 1.4.1 和 Tika Core 0.4
鉴于我读过的所有内容,这个版本的 Tika 应该支持提取 压缩文件中所有文件的数据。任何帮助或建议都会 受到赞赏。
Can you use ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?
I am sending solr the archived.tar file using curl. curl "
http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true"
-H 'Content-type:application/octet-stream' --data-binary
"@/home/archived.tar"
The result I get when I query the document is that the file names inside the
archive are indexed as the "body_texts", but the content of those files is
not extracted or included. This is not the behavior I expected. Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.
When I send 1 of the actual documents inside the archive using the same curl
command the extracted content is then stored in the "body_texts" field. Am
I missing a step for the compressed files?
I have added all the extraction dependencies as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to successfully extract data from MS Word, PDF, HTML documents.
I'm using the following library versions.
Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4
Given everything I have read this version of Tika should support extracting
data from all files within a compressed file. Any help or suggestions would
be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简短的回答:Solr Cell 1.4.1 和 Tika Core 0.6。
长答案:在经历了很多头痛之后,我终于能够让它工作了。我将为直接使用 solr 的人和使用 solr 与 Ruby 库 sunspot 的人(这是我的问题)回答这个问题。
这就是我所做的:我使用这个 https://github.com/tomasc/sunspot_cell 插件来扩展太阳黑子并赋予它附件功能。 (如果您不使用 ruby/sunspot,请忽略此步骤)
v1.4.1 适用于单个文件,但不适用于压缩文件,因此我必须进行一些探索。我从 http://lucene.apache.org/solr/ 下载了 v1.4.1 代码库并抓取dist/apache-solr-cell-1.4.1.jar 然后我必须从 1.5 分支下载 Tika 库 http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/< /a>.
您可以单独下载每个文件,也可以使用 svn 来签出分支,
或者仅签出库文件夹:
The short answer: Solr Cell 1.4.1 and Tika Core 0.6.
The long answer: After a lot of headaches I was able to get this working. I'll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem).
Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you're not using ruby/sunspot)
v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.
You can download each individually, or you can use svn to checkout the branch by
Or just checkout the library folder: