Lucene .NET Azure Blob 存储和 IFilter
在 Azure 解决方案中使用 IFilter 从 pdf/word/其他内容中提取文本内容的最佳方法是什么?
我见过使用流的 IFilter 示例,但是流的内容应该是什么? 它是否应该包含某种 OLE 标头,而不应该包含什么?
将原始文件内容作为流发送到 IFilter 似乎不起作用。
或者将文件保存到本地文件存储并让 IFilter 从该位置读取它们会更好吗?
What would be the best way to use IFilter to extract textual content from pdf/word/whatever in an Azure solution?
I've seen examples of IFilter that use a stream, but what should the content of the stream be?
Should it contain some sort of OLE headers and what not?
Sending the raw file content as a stream to IFilter doesnt seem to work.
Or would it be better to save the files to local file storage and let the IFilter read them from that location?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 azure 中使用 ifilter 会很棘手,因为桌面上常见的几个 ifilter 在 azure web/worker 角色中不可用。
您可以在 azure 中创建持久的 VM 并安装缺少的 ifilter。
但是,如果您要通过网络上传构建 lucene 索引,您可以在上传文件时将其处理为文本,然后对文本进行索引,并单独保存文件。向索引添加一个字段,以便您返回原始源文档。
可能是一个更简单的方法,但这就是我解决同样问题的方法。
using ifilter in azure will be tricky because several of the ifilters that are common on a desktop aren't available in an azure web/worker role.
You could create a durable VM in azure and install the missing ifilters.
However, if you're going to build your lucene index via a webupload you could just process the files into text as they are uploaded, and then index the text, and save the file off separately. Add a field to your index that lets you get back to the original source document.
Might be an easier way, but that's how I solved the same issue.