使用 Apache Solr 检索提取的文本
我是 Apache Solr 的新手,我想用它来索引 pdf 文件。到目前为止,我已成功启动并运行它,现在可以搜索添加的 pdf 文件。
但是,我需要能够从结果中检索搜索到的文本。
我在默认的 solrconfig.xml 中找到了一个与此相关的 xml 片段:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
从我从这里得到的信息(http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika),我想我必须向 schema.xml 添加一个新字段(例如“内容”)已存储=“true”并且索引=“true”。但是,我不太确定如何准确地实现这一目标?
任何帮助表示赞赏,谢谢
I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files.
However, I need to be able to retrieve the searched text from the results.
I found an xml snippet in the default solrconfig.xml concerning exactly that:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
From what I get from here (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika), I think I have to add a new field to schema.xml (e.g. "content") that has stored="true" and indexed="true". However, I'm not really sure how to accomplish this exactly?
any help appreciated, thx
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
添加如下所示的 schema.xml:
如果“字段”是“存储”的,默认情况下它将显示在结果中。
add a schema.xml looking like this:
If the "field" is "stored", it will show up in the results, by default.