Apache Tika 和解析文档时的字符限制
有人可以帮我解决一下吗?
可以这样完成
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
但是如果你不直接使用Tika,就像这样:
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ParseContext ps = new ParseContext();
for (InputStream is : getInputStreams()) {
parser.parse(is, textHandler, metadata, ps);
is.close();
System.out.println("Title: " + metadata.get("title"));
System.out.println("Author: " + metadata.get("Author"));
}
没有办法设置它,因为你不与WriteOutContentHandler
交互。顺便说一句,它默认设置为 -1
这意味着没有限制。但最终的限制是 100000 个字符。
/**
* The maximum number of characters to write to the character stream.
* Set to -1 for no limit.
*/
private final int writeLimit;
/**
* Number of characters written so far.
*/
private int writeCount = 0;
private WriteOutContentHandler(Writer writer, int writeLimit) {
this.writer = writer;
this.writeLimit = writeLimit;
}
/**
* Creates a content handler that writes character events to
* the given writer.
*
* @param writer writer
*/
public WriteOutContentHandler(Writer writer) {
this(writer, -1);
}
Could please anybody help me to sort it out?
It can be done like this
Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024);
But if you don't use Tika directly, like this:
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ParseContext ps = new ParseContext();
for (InputStream is : getInputStreams()) {
parser.parse(is, textHandler, metadata, ps);
is.close();
System.out.println("Title: " + metadata.get("title"));
System.out.println("Author: " + metadata.get("Author"));
}
There is no way to set it up, because you don't interact with the WriteOutContentHandler
. Btw it is set to -1
by default which means no restrictions. But the resulting restriction is 100000 characters.
/**
* The maximum number of characters to write to the character stream.
* Set to -1 for no limit.
*/
private final int writeLimit;
/**
* Number of characters written so far.
*/
private int writeCount = 0;
private WriteOutContentHandler(Writer writer, int writeLimit) {
this.writer = writer;
this.writeLimit = writeLimit;
}
/**
* Creates a content handler that writes character events to
* the given writer.
*
* @param writer writer
*/
public WriteOutContentHandler(Writer writer) {
this(writer, -1);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您一定忽略了内容处理程序具有带有 writelimit 的构造函数。
You must have overlooked that the content handler has constructor with writelimit.