Apache Tika：解析文本文件省略了最后一部分？

发布于 2024-11-19 00:20:05 字数 1052 浏览 7 评论 0原文

我正在尝试使用 Tika 解析纯文本文件，但变得不一致行为。

更具体地说，我定义了一个简单的处理程序，如下所示：

public class MyHandler extends DefaultHandler
{
     @Override
     public void characters(char ch[], int start, int length) throws SAXException
     {
        System.out.println(new String(ch));
     }
}

然后，我按如下方式解析文件（“myfile.txt”）：

Tika tika = new Tika();
InputStream is = new FileInputStream("myfile.txt");

Metadata metadata = new Metadata();
ContentHandler handler = new MyHandler();

Parser parser = new TXTParser();
ParseContext context = new ParseContext();

String mimeType = tika.detect(is);
metadata.set(HttpHeaders.CONTENT_TYPE, mimeType);

tikaParser.parse(is, handler, metadata, context);

我希望文件中的所有文本都打印在屏幕上，但是一个小部分到底是不是。更具体地说，characters() 回调每次回调都会读取 4,096 个字符，但最终显然遗漏了这个特定文件的最后 5,083 个字符（这是几个 MB long)，因此它甚至超出了错过最后一次回调的范围。

另外，在另一个大约 5,000 个字符长的小文件上进行测试，似乎没有回调发生！

在这两种情况下，MIME 类型都被正确检测为 text/plain。

有什么想法吗？

谢谢！

原文

I am trying to parse a plain text file using Tika but getting inconsistent
behavior.

More specifically, I have defined a simple handler as follows:

public class MyHandler extends DefaultHandler
{
     @Override
     public void characters(char ch[], int start, int length) throws SAXException
     {
        System.out.println(new String(ch));
     }
}

Then, I parse the file ("myfile.txt") as follows:

Tika tika = new Tika();
InputStream is = new FileInputStream("myfile.txt");

Metadata metadata = new Metadata();
ContentHandler handler = new MyHandler();

Parser parser = new TXTParser();
ParseContext context = new ParseContext();

String mimeType = tika.detect(is);
metadata.set(HttpHeaders.CONTENT_TYPE, mimeType);

tikaParser.parse(is, handler, metadata, context);

I would expect all the text in the file to be printed out on screen, but a
small part in the end is not. More specifically, the characters() callback
keeps reading 4,096 characters per callback but in the end it apparently
leaves out the last 5,083 characters of this particular file (which is a few
MB long), so it even goes beyond missing the last callback.

Also, testing on another, small file, which is about 5,000 characters long,
no callback seems to take place!

The MIME type is correctly detected as text/plain in both cases.

Any ideas?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟沫凡尘 2024-11-26 00:20:05

您使用什么版本的蒂卡？查看源代码，它读取 4096 字节的块，可以在 TXTParser。在 132 行，调用 characters(...) 例程。

简而言之，目标代码是：

   char[] buffer = new char[4096];
   int n = reader.read(buffer);
   while (n != -1) {
       xhtml.characters(buffer, 0, n);
       n = reader.read(buffer);
   }

其中reader是一个BufferedReader。我看不到这段代码有任何缺陷，因此我认为您可能正在使用旧版本？

What version of Tika are you using? Looking at the source code it reads chunks of 4096 bytes which can be seen on line 129 of TXTParser. At line 132 the characters(...) routine is invoked.

In short, the target code is:

   char[] buffer = new char[4096];
   int n = reader.read(buffer);
   while (n != -1) {
       xhtml.characters(buffer, 0, n);
       n = reader.read(buffer);
   }

where reader is a BufferedReader. I cannot see any flaw in this code, hence I'm thinking you might be working an older version?

回复收藏 0 原文

~没有更多了~