当前位置：文江博客话题详情

实现搜索文档（PDF、XML、HTML、MS Word）的最佳方法是什么？

发布于 2024-07-18 22:03:56 字数 70 浏览 6 评论 0原文

在 Java Web 应用程序中编写搜索功能以搜索文档的好方法是什么？

“标记搜索”是否适合此类搜索功能？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别理我 2024-07-25 22:03:56

为什么要重新发明轮子？

查看 Apache Lucene。

另外，在 Stack Overflow 上搜索“全文搜索”，您会发现很多其他非常相似的问题。这是另一个例子，例如：
如何在网站中实现搜索功能？

回复收藏 0 原文

素年丶 2024-07-25 22:03:56

您可以使用 Solr，它位于 Lucene 之上，是一个真正的 Web 搜索引擎应用程序，而Lucene 是一个库。然而，Solr 或 Lucene 都不会解析 Word 文档、pdf 等来提取元数据信息。有必要根据预定义的文档模式对文档进行索引。

回复收藏 0 原文

雅心素梦 2024-07-25 22:03:56

至于提取Office文档的文本内容（在将其交给Lucene之前需要这样做），有Apache Tika项目，它支持相当多的文件格式，包括 Microsoft 的。

回复收藏 0 原文

小嗷兮 2024-07-25 22:03:56

使用 Tika，从文件中获取文本的代码非常简单：

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;

// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()

到目前为止，Tika 0.3 似乎运行良好。只要向它扔任何文件，它就会返回对该格式最有意义的内容。我可以获得用于索引迄今为止我所提交的任何内容的文本，包括 PDF 和新的 MS Office 文件。如果某些格式存在问题，我相信它们主要在于获取格式化文本提取而不仅仅是原始纯文本。

Using Tika, the code to get the text from a file is quite simple:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.Parser;

// exception handling not shown
Parser parser = new AutoDetectParser();
StringWriter textBuffer = new StringWriter();
InputStream input = new FileInputStream(file);
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, file.getName());
parser.parse(input, new BodyContentHandler(textBuffer), md);
String text = textBuffer.toString()

So far, Tika 0.3 seems to work great. Just throw any file at it and it will give you back what makes the most sense for that format. I can get the text for indexing of anything I've thrown at it so far, including PDF's and the new MS Office files. If there are problems with some formats, I believe they mainly lie in getting formatted text extraction rather than just raw plaintext.

回复收藏 0 原文