Solr ExtractingRequestHandler 为 pdf 文档提供空内容
我在 Solr 中使用 ExtractingRequestHandler 来获取文档内容并为其建立索引。它适用于所有 Microsoft 文档,但对于 PDF,提取的内容为空。我还尝试了使用curl 的extractOnly=true ,它也只返回空的主体。
我在相同的文档上独立使用了 TIKA,并且可以很好地提取内容。不同之处在于,当独立执行时,我使用 Tika 附带的 BodyContentHander,而不是 Solr 使用的 SolrContentHandler。有人见过这个吗?
我真的宁愿让 Solr 处理它,也不愿使用 Tika 在 Solr 之外提取内容。
I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also returns just the empty body.
I have used TIKA independently on the same documents and that extracts content just fine. The difference is when doing independently I am using BodyContentHander that comes with Tika instead of SolrContentHandler which is used by Solr. Has anybody seen this?
I would really rather let Solr handle it than me using Tika to extract content outside of Solr.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在解决这个问题之前,我花了几个小时处理这个问题——我以非二进制模式打开 PDF,并将它们仅提供给 solr,直到文件中的第一个 EOF 字符。 Solr 仍会从文件中提取元数据(如 PDF 标题中所示),但会在响应中返回一个空的正文标记。
这可能不适用于原始海报,但它可能确实可以帮助其他人避免浪费生命中的时间。
I just dealt with this problem for hours before figuring it out -- I was opening my PDFs in an non-binary mode, and feeding them to solr only up to the first EOF character in the file. Solr will still extract the metadata from the file (as it appears in the header of the PDF), but will return an empty body tag in its response.
This may not apply to the original poster, but it may really help someone else from wasting hours of their life.