Solr 的 TikaEntityProcessor 不工作

发布于 2024-09-03 10:17:04 字数 1061 浏览 9 评论 0原文

我正在尝试让 Solr 索引一个数据库,其中一列是我想要索引的 PDF 文档的文件名。我的配置如下所示:

<dataConfig>
 <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/>
 <dataSource name="ds-file" type="BinFileDataSource"/>
 <document name="documents">
   <entity name="document" dataSource="ds-db" query="select * from documents">
     <entity processor="TikaEntityProcessor" url="/some/path/${document.filename}" dataSource="ds-file" format="text">
       <field column="text" />
     </entity>
   </entity>
 </document>
</dataConfig>

我正在使用 trunk 中的 Solr(截至上周)。导入过程完成时没有错误,并且它从数据库中选取列,但不从 PDF 文件中选取内容。它肯定是在尝试访问 PDF 文件,因为如果我给它提供了不正确的路径名,它就会抱怨。不过,它似乎并没有尝试对 PDF 建立索引,因为它在大约 40 毫秒内完成,而如果我通过 ExtractingRequestHandler 导入 PDF,则需要大约 11 秒来对其建立索引。

我还尝试了 example-DIH 中的 tika 示例,但它似乎也没有索引任何内容。我做错了什么,还是这还不起作用?

我在 OSX 10.6.3 上运行 Java 1.6.0_20。

(我应该注意,我已经将其发布到 solr 用户邮件列表上,但没有得到答复。)

I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this:

<dataConfig>
 <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/>
 <dataSource name="ds-file" type="BinFileDataSource"/>
 <document name="documents">
   <entity name="document" dataSource="ds-db" query="select * from documents">
     <entity processor="TikaEntityProcessor" url="/some/path/${document.filename}" dataSource="ds-file" format="text">
       <field column="text" />
     </entity>
   </entity>
 </document>
</dataConfig>

I'm using Solr from trunk (as of last week). The import process completes without errors, and it picks up the columns from the database, but not the content from the PDF file. It is definitely trying to access the PDF file, for if I give it an incorrect path name, it complains. It doesn't seem to be attempting to index the PDF, though, as it completes in about 40ms, whereas if I import the PDF via the ExtractingRequestHandler, it takes about 11 seconds to index it.

I've also tried the tika example in example-DIH and that doesn't seem to index anything, either. Am I doing something wrong, or is this just not working yet?

I'm running Java 1.6.0_20 on OSX 10.6.3.

(I should note that I already posted this on the solr-user mailing list and didn't get an answer.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

帅冕 2024-09-10 10:17:04

solr-user 邮件列表上的某人给出了答案: http: //lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html

基本上,Apache Tika中有一个bug,是在0.6版本之后引入的,并且它显然仍然存在于0.8快照中当前位于 Solr 的主干中。下载 Tika 0.6(来自 http://archive.apache.org/dist/lucene/tika/ )并将 tika-core-0.6.jar 和 tika-parsers-0.6.jar 复制到路径中解决了该问题。

Someone on the solr-user mailing list had the answer: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p867572.html

Basically, there's a bug in Apache Tika that was introduced after version 0.6, and it is apparently still present in the 0.8 snapshot that is currently in Solr's trunk. Downloading Tika 0.6 (from http://archive.apache.org/dist/lucene/tika/) and copying tika-core-0.6.jar and tika-parsers-0.6.jar into the path fixed the issue.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文