如何使用 SolrJ 索引 pdf 内容?

发布于 2024-11-01 13:41:55 字数 2782 浏览 4 评论 0原文

我正在尝试使用 SolrJ 索引一些 pdf 文档,如 http://wiki.apache.org/ 中所述solr/ContentStreamUpdateRequestExample,下面是代码:

import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
import static org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
import static org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
...
public static void indexFilesSolrCell(String fileName) throws IOException, SolrServerException {

  String urlString = "http://localhost:8080/solr"; 
  SolrServer server = new CommonsHttpSolrServer(urlString);

  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
  up.addFile(new File(fileName));
  String id = fileName.substring(fileName.lastIndexOf('/')+1);
  System.out.println(id);

  up.setParam(LITERALS_PREFIX + "id", id);
  up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't exists in schema.xml, it'll be created as attr_location
  up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
  up.setParam(MAP_PREFIX + "content", "attr_content");
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

  NamedList<Object> request = server.request(up);
  for(Entry<String, Object> entry : request){
    System.out.println(entry.getKey());
    System.out.println(entry.getValue());
  }
}

不幸的是,当查询 *:* 时,我得到了索引文档列表,但内容字段为空。如何更改上面的代码以提取文档的内容?

下面是描述此文档的 xml 框架:

<doc>
  <arr name="attr_content">
    <str>            </str>
  </arr>
  <arr name="attr_location">
    <str>/home/alex/Documents/lsp.pdf</str>
  </arr>
  <arr name="attr_meta">
    <str>stream_size</str>
    <str>31203</str>
    <str>Content-Type</str>
    <str>application/pdf</str>
  </arr>
  <arr name="attr_stream_size">
    <str>31203</str>
  </arr>
  <arr name="content_type">
    <str>application/pdf</str>
  </arr>
  <str name="id">lsp.pdf</str>
</doc>

我认为这不是问题与 Apache Tika 的错误安装有关,因为之前我遇到了一些 ServerException,但现在我已经在正确的路径中安装了所需的 jar。此外,我尝试使用相同的类来索引 txt 文件,但 attr_content 字段始终为空。

I'm trying to index a few pdf documents using SolrJ as described at http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's the code:

import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
import static org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
import static org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
...
public static void indexFilesSolrCell(String fileName) throws IOException, SolrServerException {

  String urlString = "http://localhost:8080/solr"; 
  SolrServer server = new CommonsHttpSolrServer(urlString);

  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
  up.addFile(new File(fileName));
  String id = fileName.substring(fileName.lastIndexOf('/')+1);
  System.out.println(id);

  up.setParam(LITERALS_PREFIX + "id", id);
  up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't exists in schema.xml, it'll be created as attr_location
  up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
  up.setParam(MAP_PREFIX + "content", "attr_content");
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

  NamedList<Object> request = server.request(up);
  for(Entry<String, Object> entry : request){
    System.out.println(entry.getKey());
    System.out.println(entry.getValue());
  }
}

Unfortunately when querying for *:* I get the list of indexed documents but the content field is empty. How can I change the code above to extract also the document's content?

Below there's the xml frament that describes this document:

<doc>
  <arr name="attr_content">
    <str>            </str>
  </arr>
  <arr name="attr_location">
    <str>/home/alex/Documents/lsp.pdf</str>
  </arr>
  <arr name="attr_meta">
    <str>stream_size</str>
    <str>31203</str>
    <str>Content-Type</str>
    <str>application/pdf</str>
  </arr>
  <arr name="attr_stream_size">
    <str>31203</str>
  </arr>
  <arr name="content_type">
    <str>application/pdf</str>
  </arr>
  <str name="id">lsp.pdf</str>
</doc>

I don't think that this problem is related to an incorrect installation of Apache Tika, because previously I had a few ServerException but now I've installed the required jars in the correct path. Moreover I've tried to index a txt file using the same class but the attr_content field is always empty.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

戒ㄋ 2024-11-08 13:41:55

在 schema.xml 文件中,您是否在内容字段中设置了“stored= true”,这是我的 schema.xml 文件的示例,我用它来存储 pdf 和其他二进制文件的内容。

<字段名称=“文本”类型=“textgen”索引=“true”
存储=“真”
required="false" multiValued="true"/>

对您有帮助吗?

赫克托

In the schema.xml file, have you set "stored= true" in the content field, an example of my schema.xml file, taht I use to store the content of pdf and other binaries files.

<field name="text" type="textgen" indexed="true"
stored="true"
required="false" multiValued="true"/>

Did it help you?

Héctor

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文