SOLR/LUCENE专家，请帮我设计一个简单的从PDF索引进行关键字搜索的方法？

发布于 2024-11-27 16:35:58 字数 2298 浏览 5 评论 0原文

我涉足 solr 但无法找到一种方法来根据我的需求定制它。

我拥有的：

一堆 PDF 文件。一组关键字。

我想要实现的目标：

索引 PDF 文件（solrcell - 完成）搜索关键字（可以）定制输出以吐出 PDF 文件的名称，摘录关键字出现的位置（没有线索/不知道如何做）

尝试操作 ResponseHandler/Schema.xml/Solrconfig.xml 无济于事。

Lucene/solr专家，您认为我想要实现的目标可能吗？

我将现有代码放在 github @ https://github.com/ThinkCode/solr_search （主要是solr 的默认示例对字段进行了少量修改（所有内容都存储在一个内容字段中），

schema.xml 中的显着变化是：

Schema.xml：

<solrQueryParser defaultOperator="AND"/>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

   <dynamicField name="*" type="string"    indexed="true"  stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

<solrQueryParser defaultOperator="AND"/>

<copyField source="*" dest="content"/>

当前输出：

（查询） http://localhost:8983/solr/select/?q=Java+Servlet&version=2.2&start=0&rows=10&indent=on

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">Java Servlet</str><str name="version">2.2</str><str name="rows">10</str></lst></lst>

<result name="response" numFound="1" start="0"><doc><arr name="content_type"><str>application/pdf</str></arr><str name="id">tutorial.pdf</str><str name="subject">Solr</str><arr name="title"><str>Solr tutorial</str></arr></doc></result></response>

什么我正在寻找的是“找到关键字的提取片段（行）”。

在提供的查询中，我搜索“Java Servlet”，它返回了文档。我对输出 xml 中返回的上下文“Solr 可以在您选择的任何 Java Servlet 容器中运行”感兴趣。

原文

I dabbled with solr but couldn't figure out a way to tailor it to my reuqirement.

What I have :

A bunch of PDF files.
A set of keywords.

What I am trying to achieve :

Index the PDF files (solrcell - done)
Search for a keyword (works ok)
Tailor the output to spit out the names of the PDF files, an excerpt where the keyword occurred (No clue/idea how to)

Tried manipulating ResponseHandler/Schema.xml/Solrconfig.xml to no avail.

Lucene/solr experts, do you think what I am trying to achieve is possible?

I put my existing code on github @ https://github.com/ThinkCode/solr_search (which is mostly solr's default example with minor modifications to the fields (all the content is stored in one content field).

Notable changes in schema.xml being :

Schema.xml :

<solrQueryParser defaultOperator="AND"/>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

   <dynamicField name="*" type="string"    indexed="true"  stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

<solrQueryParser defaultOperator="AND"/>

<copyField source="*" dest="content"/>

Current Output :

(query)
http://localhost:8983/solr/select/?q=Java+Servlet&version=2.2&start=0&rows=10&indent=on

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">Java Servlet</str><str name="version">2.2</str><str name="rows">10</str></lst></lst>

<result name="response" numFound="1" start="0"><doc><arr name="content_type"><str>application/pdf</str></arr><str name="id">tutorial.pdf</str><str name="subject">Solr</str><arr name="title"><str>Solr tutorial</str></arr></doc></result></response>

What I am looking for is 'extracted fragment (line) where the keyword was found'.

In the query provided, I search for 'Java Servlet' and it returned the document. I am interested in the context 'Solr can run in any Java Servlet Container of your choice' to be returned in the output xml.

分享到QQ

分享到微博