Apache Tika 和解析文档时的字符限制

发布于 2024-11-09 19:40:55 字数 1255 浏览 4 评论 0原文

有人可以帮我解决一下吗?

可以这样完成

   Tika tika = new Tika();
   tika.setMaxStringLength(10*1024*1024);

但是如果你不直接使用Tika,就像这样:

ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();

ParseContext ps = new ParseContext();
for (InputStream is : getInputStreams()) {
    parser.parse(is, textHandler, metadata, ps);
    is.close();
    System.out.println("Title: " + metadata.get("title"));
    System.out.println("Author: " + metadata.get("Author"));
}

没有办法设置它,因为你不与WriteOutContentHandler交互。顺便说一句,它默认设置为 -1 这意味着没有限制。但最终的限制是 100000 个字符。

/**
 * The maximum number of characters to write to the character stream.
 * Set to -1 for no limit.
 */
private final int writeLimit;

/**
 * Number of characters written so far.
 */
private int writeCount = 0;

private WriteOutContentHandler(Writer writer, int writeLimit) {
    this.writer = writer;
    this.writeLimit = writeLimit;
}

/**
 * Creates a content handler that writes character events to
 * the given writer.
 *
 * @param writer writer
 */
public WriteOutContentHandler(Writer writer) {
    this(writer, -1);
}

Could please anybody help me to sort it out?

It can be done like this

   Tika tika = new Tika();
   tika.setMaxStringLength(10*1024*1024);

But if you don't use Tika directly, like this:

ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();

ParseContext ps = new ParseContext();
for (InputStream is : getInputStreams()) {
    parser.parse(is, textHandler, metadata, ps);
    is.close();
    System.out.println("Title: " + metadata.get("title"));
    System.out.println("Author: " + metadata.get("Author"));
}

There is no way to set it up, because you don't interact with the WriteOutContentHandler. Btw it is set to -1 by default which means no restrictions. But the resulting restriction is 100000 characters.

/**
 * The maximum number of characters to write to the character stream.
 * Set to -1 for no limit.
 */
private final int writeLimit;

/**
 * Number of characters written so far.
 */
private int writeCount = 0;

private WriteOutContentHandler(Writer writer, int writeLimit) {
    this.writer = writer;
    this.writeLimit = writeLimit;
}

/**
 * Creates a content handler that writes character events to
 * the given writer.
 *
 * @param writer writer
 */
public WriteOutContentHandler(Writer writer) {
    this(writer, -1);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一个人练习一个人 2024-11-16 19:40:55

您一定忽略了内容处理程序具有带有 writelimit 的构造函数。

ContentHandler textHandler = new BodyContentHandler(int writeLimit);

You must have overlooked that the content handler has constructor with writelimit.

ContentHandler textHandler = new BodyContentHandler(int writeLimit);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文