使用 Solr 提取 PDF 元数据时出错
我正在使用 Solr 3.3,我正在尝试从 PDF 文件中提取元数据并为其建立索引。我使用 DataImportHandler 和 TikaEntityProcessor 来添加文档。以下是我的 schema.xml 文件中定义的字段:
<field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
<field name="imgName" type="string" indexed="false" stored="true" multiValued="false" required="false"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>
因此,我认为元数据信息应该被索引并存储在前缀为“attr_”的字段中。
这是我的数据配置文件的外观。它从数据库获取源目录路径,将其传递给 FileListEntityProcessor,后者会将目录中找到的每个 pdf 文件传递给 TikaEntityProcessor 以提取内容并建立索引。
<entity onError="skip" name="fileSourcePaths" rootEntity="false" dataSource="dbSource" fileName=".*pdf" query="select path from file_sources">
<entity name="fileSource" processor="FileListEntityProcessor" transformer="ThumbnailTransformer" baseDir="${fileSourcePaths.path}" recursive="true" rootEntity="false">
<field name="link" column="fileAbsolutePath" thumbnail="true"/>
<field name="imgName" column="imgName"/>
<entity rootEntity="true" onError="abort" name="file" processor="TikaEntityProcessor" url="${fileSource.fileAbsolutePath}" dataSource="fileSource" format="text">
<field column="resourceName" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
<field column="text" name="description"/>
</entity>
</entity>
它提取描述和创建日期很好,但它似乎没有提取资源名称,因此当我查询索引时,文档没有标题字段。这很奇怪,因为创建日期和资源名称都是元数据。此外,其他可能的元数据都没有存储在 attr_ 字段下。我遇到一些线程说使用 Tika 0.8 存在已知问题,所以我下载了 Tika 0.9 并将其替换为 0.8。我还下载并替换了 pdfbox、jempbox 和 fontbox 从 1.3 到 1.4。
我仅使用 Tika 单独测试了其中一份 pdf,以查看该文件存储了哪些元数据。这是我发现的:
Content-Length: 546459
Content-Type: application/pdf
Creation-Date: 2010-06-09T12:11:12Z
Last-Modified: 2010-06-09T14:53:38Z
created: Wed Jun 09 08:11:12 EDT 2010
creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows
producer: Antenna House PDF Output Library 2.6.0 (Windows)
resourceName: Argentina.pdf
trapped: False
xmpTPg:NPages: 2
如您所见,它确实有一个resourceName元数据。我再次尝试索引,但得到了相同的结果。创建日期提取和索引很好,但不是资源名称。此外,其余属性不会在 attr_ 字段下建立索引。
出了什么问题吗?
I am using Solr 3.3 and I am trying to extract and index meta data from PDF files. I am using the DataImportHandler with the TikaEntityProcessor to add the documents. Here is are the fields as defined in my schema.xml file:
<field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
<field name="imgName" type="string" indexed="false" stored="true" multiValued="false" required="false"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>
So I suppose the meta data information should be indexed and stored in fields prefixed as "attr_".
Here is how my data config file looks. It takes a source directory path from a database, passes it to a FileListEntityProcessor which will pass each of the pdf files found in the directory to the TikaEntityProcessor to extract and index the content.
<entity onError="skip" name="fileSourcePaths" rootEntity="false" dataSource="dbSource" fileName=".*pdf" query="select path from file_sources">
<entity name="fileSource" processor="FileListEntityProcessor" transformer="ThumbnailTransformer" baseDir="${fileSourcePaths.path}" recursive="true" rootEntity="false">
<field name="link" column="fileAbsolutePath" thumbnail="true"/>
<field name="imgName" column="imgName"/>
<entity rootEntity="true" onError="abort" name="file" processor="TikaEntityProcessor" url="${fileSource.fileAbsolutePath}" dataSource="fileSource" format="text">
<field column="resourceName" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
<field column="text" name="description"/>
</entity>
</entity>
It extracts the description and Creation-date just fine but it doesn't seem like it is extracting resourceName and so there is no title field for the documents when I query the index . This is weird because both Creation-date and resourceName are meta data. Also, none of the other possible meta data was being stored under the attr_ fields. I came across some threads which said there are know problems with using Tika 0.8 so I downloaded Tika 0.9 and replaced it over 0.8. I also downloaded and replaced pdfbox, jempbox and fontbox from 1.3 to 1.4.
I tested one of the pdf's separately with just Tika to see what meta data is stored with the file. This is what I found:
Content-Length: 546459
Content-Type: application/pdf
Creation-Date: 2010-06-09T12:11:12Z
Last-Modified: 2010-06-09T14:53:38Z
created: Wed Jun 09 08:11:12 EDT 2010
creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows
producer: Antenna House PDF Output Library 2.6.0 (Windows)
resourceName: Argentina.pdf
trapped: False
xmpTPg:NPages: 2
As you can see, it does have a resourceName meta data. I tried indexing again but I got the same result. Creation-date extracts and indexes just fine but not resourceName. Also the rest of the attributes are not being indexed under the attr_ fields.
Whats going wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论