Apache Tika 和文件访问而不是 Java 输入流

发布于 2024-11-07 21:03:22 字数 623 浏览 0 评论 0原文

我希望能够创建一个新的 Tika 解析器来从文件中提取元数据。我们已经在使用 Tika,并且元数据提取将始终如一地完成。

我认为我遇到了 Tika 的这个问题/增强请求:

允许传递文件或解析器的内存缓冲区

我有一个控制台 C++ 可执行文件,它接受输入上的文件路径,然后输出它找到的元数据,每行由名称/值对组成。
C++ 代码依赖于在访问数据时需要文件路径的库。 用 Java 重写这个可执行文件是不可能的。 我认为将其插入 Tika 会相当容易。但是 Tika 解析器需要使用 Java,并且需要重写的 Tika 解析器方法需要一个开放的输入流:

void parse(InputStream stream, ContentHandler handler, Metadatametadata, ParseContext context)

所以我想我唯一的解决方案是获取输入流并将其写入临时文件,然后处理写入的文件,最后清理该文件。我讨厌弄乱临时文件,然后可能不得不担心临时文件的清理,如果出现问题并且它不会被删除。

有没有人有一个聪明的主意如何干净地处理这样的事情?

I want to be able to create a new Tika parser to extract metadata from a file. We're already using Tika and the metadata extraction will be done consistently.

I think that I've run into this problem/enhancement request for Tika:

Allow passing of files or memory buffers to parsers

I have a console c++ executable that accepts the path to a file on input and then outputs the metadata that it finds, each line consisting of name/value pairs.
The c++ code relies on libraries that expect a file path when accessing the data.
It's not going to be possible to rewrite this executable in Java.
I thought that it would be fairly easy to plug this into Tika. But the Tika parser needs to be in Java and the Tika parser method that needs to be overridden takes an open input stream:

void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

So I guess that my only solution will be to take the input stream and write it to a temporary file and then to process the file that gets written and to then finally clean up the file. I hate messing with a temporary file and then potentially having to worry about cleanup of temp files should something go wrong and it doesn't get deleted.

Does anyone have a clever idea about how to cleanly deal with something like this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

维持三分热 2024-11-14 21:03:22

TikaInputStream 应该有所帮助。它处理包装文件或输入流,并根据解析器的要求在内部进行转换。它会根据您的需要执行所有临时文件位。

一些 Java 解析器已经在使用它,因为它们需要文件而不是输入流。更重要的是,拥有文件的用户可以将其传递给包装为 InputStream 的解析器,解析器可以根据需要将其读取为 File 或 InputStream。

因此,我建议您将 InputStream 转换为 TikaInputStream (如果它已经是一个,则只是一个转换),然后获取该文件并将其传递给您的 c++。

There's TikaInputStream which should help. It handles wrapping a File or an InputStream, and converting between them internally as parsers require. It does all the temp file bits as needed for you.

Several Java parsers already make use of it because they need a File rather than an Input Stream. What's more, users who have a file can pass it to the Parser wrapped as an InputStream, and the parser can read it as either a File or an InputStream as their needs suit.

So, I'd suggest you just turn the InputStream into a TikaInputStream (which is just a cast if it's already one), then get the file and pass that to your c++.

嗫嚅 2024-11-14 21:03:22

如果我理解正确并假设您使用 Runtime.exec 启动 C++ 程序,您可以将 Process 的标准输出流解析为 InputStream > 蒂卡想要的。那行得通吗?

If I understand correctly and assuming you're launching the C++ program using Runtime.exec, you could parse the Processs standard output stream as the InputStream that Tika wants. Would that work?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文