使用 Solr 为 PDF 建立索引
我使用 Solr 的主要经验是索引 CSV 文件。但我找不到任何简单的说明/教程来告诉我需要做什么来索引 pdf。
我已经看到了这个: http://wiki.apache.org/solr/ExtractingRequestHandler
但这使得对我来说没什么意义。我需要安装 Tika 吗?
我迷路了 - 请帮忙
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

Apache Solr 现在可以索引所有类型的二进制文件,例如 PDF、Words 等...查看此文档:
https:// lucene.apache.org/solr/guide/8_5/uploading-data-with-solr-cell-using-apache-tika.html
Apache Solr can now index all sort of binary files like PDF, Words, etc ... check out this doc:
使用 solr-4.9(目前最新版本),从 pdf、电子表格(xls、xlxs 系列)、演示文稿(ppt、ppts)、文档(doc、txt 等)等丰富文档中提取数据变得相当简单。
此处包含一个基本的 solr 模板项目,可帮助您快速入门。
2.将 solrExample 中必要的 jar 添加到您的项目中。
curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt& literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "[电子邮件受保护]"
转到 GUI 门户并查询以查看索引内容。
With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple.
The sample code examples provided in the downloaded archive from
here contains a basic solr template project to get you started quickly.
The necessary configuration changes are as follows:
Change the
to include following lines :<lib dir="<path_to_extraction_libs>" regex=".*\.jar" />
<lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" />
create a request handler as follows:
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults" />
2.Add the necessary jars from the solrExample to your project.
3.Define the schema as per your needs and fire a query like :
curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "[email protected]"
go to the GUI portal and query to see the indexed contents.
Let me know if you face any problems.
您可以使用 dataImportHandler。 DataImortHandle 将在 solrconfig.xml 中定义,DataImportHandler 的配置应在不同的 XML 配置文件 (data-config.xml) 中实现。
对于索引 pdf,您可以
1.) 抓取目录以使用以下命令查找所有 pdf FileListEntityProcessor
2.) 使用 XPathEntityProcessor
如果您有相关 pdf 的列表,请使用 TikaEntityProcessor
看看这个 http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/(带有ppt的示例)和这个Solr:数据导入处理程序和 solr 单元
You could use the dataImportHandler. The DataImortHandle will be defined at the solrconfig.xml, the configuration of the DataImportHandler should be realized in an different XML config file (data-config.xml)
For indexing pdf's you could
1.) crawl the directory to find all the pdf's using the FileListEntityProcessor
2.) reading the pdf's from an "content/index"-XML File, using the XPathEntityProcessor
If you have the list of related pdf's, use the TikaEntityProcessor
look at this http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/ (example with ppt) and this Solr : data import handler and solr cell
其中最困难的部分是从 PDF 中获取元数据,使用 Aperture 这样的工具可以简化这一过程。这些工具一定有很多
Apeture 从 PDF 中抓取元数据并将其存储在 xml 文件中。
The hardest part of this is getting the metadata from the PDFs, using a tool like Aperture simplifies this. There must be tonnes of these tools
Apeture grabbed the metadata from the PDFs and stored it in xml files.
I parsed the xml files using lxml and posted them to solr
使用 Solr、ExtractingRequestHandler。这使用 Apache-Tika 来解析 pdf 文件。我相信它可以提取元数据等。您也可以传递您自己的元数据。
Use the Solr, ExtractingRequestHandler. This uses Apache-Tika to parse the pdf file. I believe that it can pull out the metadata etc. You can also pass through your own metadata.
Extracting Request Handler
This may help.