在没有 UniqueKey 的 Solr 中索引 PDF 文档

发布于 2024-11-24 09:31:34 字数 2124 浏览 3 评论 0原文

我想要索引 PDF(和其他丰富的)文档。我正在使用 DataImportHandler。

这是我的 schema.xml 的外观:

.........
.........
 <field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
   <dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>
........
........
<uniqueKey>link</uniqueKey>

如您所见,我已将 link 设置为唯一键,以便在索引发生时文档不会再次重复。现在,我已将文件路径存储在数据库中,并且已设置 DataImportHandler 来获取所有文件路径的列表并为每个文档建立索引。为了测试它,我使用了 Solr 中示例文档附带的tutorial.pdf 文件。问题当然是这个 pdf 文档不会有“链接”字段。我正在考虑如何在索引这些文档时手动将文件路径设置为链接。我尝试了如下的数据配置设置,

 <entity name="fileItems"  rootEntity="false" dataSource="dbSource" query="select path from file_paths">
   <entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
     <field column="title" name="title" meta="true"/>
     <field column="Creation-Date" name="date_published" meta="true"/>
     <entity name="filePath" dataSource="dbSource" query="SELECT path FROM file_paths as link where path = '${fileItems.path}'">
       <field column="link" name="link"/>
     </entity>
   </entity>
  </entity>

其中创建了一个子实体,该子实体查询路径名并使其在标题为“链接”的列中返回结果。但我仍然看到这个错误:

WARNING: Error creating document : SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z}, title=title(1.0)={Solr tutorial}}]
org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: link

是否有办法为 pdf 文档创建一个名为链接的字段?

这已经被问过 here 之前,但提供的解决方案使用 ExtractRequestHandler 但我想通过 DataImportHandler 使用它。

I want to index PDF (and other rich) documents. I am using the DataImportHandler.

Here is how my schema.xml looks:

.........
.........
 <field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
   <dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>
........
........
<uniqueKey>link</uniqueKey>

As you can see I have set link as the unique key so that when the indexing happens documents are not duplicated again. Now I have the file paths stored in a database and I have set the DataImportHandler to get a list of all the file paths and index each document. To test it I used the tutorial.pdf file that comes with example docs in Solr. The problem is of course this pdf document won't have a field 'link'. I am thinking of way how I can manually set the file path as link when indexing these documents. I tried the data-config settings as below,

 <entity name="fileItems"  rootEntity="false" dataSource="dbSource" query="select path from file_paths">
   <entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
     <field column="title" name="title" meta="true"/>
     <field column="Creation-Date" name="date_published" meta="true"/>
     <entity name="filePath" dataSource="dbSource" query="SELECT path FROM file_paths as link where path = '${fileItems.path}'">
       <field column="link" name="link"/>
     </entity>
   </entity>
  </entity>

where I create a sub-entity which queries for the path name and makes it return the results in a column titled 'link'. But I still see this error:

WARNING: Error creating document : SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z}, title=title(1.0)={Solr tutorial}}]
org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: link

Is there anyway for me to create a field called link for the pdf documents?

This was already asked here before but the solution provided uses ExtractRequestHandler but I want to use it through the DataImportHandler.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

手心的海 2024-12-01 09:31:34

试试这个:

<entity name="fileItems"  rootEntity="false" dataSource="dbSource" query="select path from file_paths">
  <field column="path" name="link"/>
  <entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
    <field column="title" name="title" meta="true"/>
    <field column="Creation-Date" name="date_published" meta="true"/>
  </entity>
</entity>

Try this:

<entity name="fileItems"  rootEntity="false" dataSource="dbSource" query="select path from file_paths">
  <field column="path" name="link"/>
  <entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
    <field column="title" name="title" meta="true"/>
    <field column="Creation-Date" name="date_published" meta="true"/>
  </entity>
</entity>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文