Solr Cell / ExtractingRequestHandler 无法解析某些 *.doc 文件

发布于 2024-11-15 17:07:55 字数 1183 浏览 10 评论 0原文

我需要索引用户上传的 doc/docx/pdf 文件的内容,并为此使用 Solr (1.4.1) ExtractingRequestHandler 组件 (817165)。如果这很重要,我不会请求从中建立索引 - 始终使用 extractOnly 参数调用该组件,仅返回文档的文本内容,而不是立即将其添加到索引中(然后将内容添加到索引中)外部”作为文档的文本字段,遵循标准程序)。

但是,某些文件不会被解析,并且组件会返回 500 内部服务器错误,且未提供其他详细信息。在我们的用户提交的所有 *.doc 文件中,大约有 30% 无法解析。

这不是 Solr 加载的问题 - 如果您一次又一次地解析相同的列表,则无法解析的文件始终是相同的。这也与它们的大小无关——其中许多比其他成功解析的要小。显然,这与特殊的格式无关(或者至少不明显)——几乎所有无法解析的文档都有彩色字体、表格和图像,但许多解析成功的文档也有相同的。

所有这些文件都可以在 Word 中打开,没有任何警告或错误。如果将它们保存为 docx Solr 会开始正确解析它们,但以相同的 doc 格式和相同的内容重新保存它们并没有帮助。不过,如果所有内容都被删除并替换为一些 lorem ipsum 文本,然后另存为 doc,它们就会变得正确。

由于内容替换有帮助,它应该是文档中使用的一些元素,但 上没有描述Tika Formats 页面告诉在哪些情况下文档解析失败。

我上传了示例文件 如果有人有足够的好奇心尝试它,则无法对其进行解析(将其存档以防止 Windows Live 将其转换为“在线文档”)。

目前,作为一种解决方法,我使用一个古老的 antiword 实用程序来解析 Solr 上的那些 *.doc失败(并且反词完美地解析它们)。尽管如此,这显然是一个拐杖,我想知道是否还有其他人面临同样的问题 - 我未能用谷歌搜索它,所以可能是我做错了什么。

或者,如果这是一个已知问题,那么有什么更优雅的方法来解决它(我不喜欢依赖反词)?

I need to index content of doc/docx/pdf files uploaded by users and use Solr (1.4.1) ExtractingRequestHandler component (817165) for that. If that matters, I don't request indexing from it - the component is always called with extractOnly parameter returning text content of the document only and not adding it to the index on its own straight away (the content is then added to the index "outside" as a text field of the document following the standard procedure).

However, some files are not parsed and the component returns 500 Internal Server Error with no other details provided. Of all *.doc files submitted by our users about 30% of them fail to parse.

It is not the problem with Solr load - the files that cannot be parsed are always the same if you parse the same list of them again and again. It is also not about their size - many of them are smaller than other ones parsed successfully. Apparently, it is not about peculiar formatting (or at least that is not obvious) - almost all documents that fail to parse have coloured fonts, tables and images but many of the ones parsed successfully also have the same.

All these files open in Word without any warnings or errors. If you save them as docx Solr starts parsing them correctly but re-saving them in the same doc format with the same content doesn't help. Still, if all the content is removed and replaced by some lorem ipsum text, then saved as doc, they become correct.

As the content replacing helps, it should be something with some elements used in the documents but there is no description on Tika Formats page telling in which cases parsing of the document fails.

I've uploaded a sample file which fails to be parsed in case if anyone is curious enough to try it (it is archived to prevent Windows Live from converting it into "online document").

Currently as a way around I use an ancient antiword utility to parse those *.doc on which Solr fails (and antiword parses them perfectly). Still, it is obviously a crutch and I wonder if anybody else is facing the same issue - I failed to google it so probably that's me doing something wrong.

Or, if that's a known problem, what could be more elegant ways to solve it (I don't like relying on antiword)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鯉魚旗 2024-11-22 17:07:55

如果我是你,我会尝试升级 Tika

我已经获取了你的示例文件,并尝试使用最新版本的 Tika。提取到文本效果很好,我明白

LOREM IPSUM
Lorem ipsum dolor sit amet
------

Home Phone:           000000000

Work   :   00000000           

(等等)

所以我怀疑这是旧版本 POI+Tika 的问题,现已修复。

(如果您使用的是 SOLR 的自定义构建副本,那么您可能只需要在 pom 中增加 Tika 依赖项并重新构建,maven 会为您处理它。否则较新的 SOLR 应该有一个较新的 SOLR蒂卡作为标准)

I'd try upgrading Tika if I were you

I've taken your sample file, and tried it with the latest version of Tika. Extracting to text works just fine, I see

LOREM IPSUM
Lorem ipsum dolor sit amet
------

Home Phone:           000000000

Work   :   00000000           

(etc)

So I suspect it's an issue with older versions of POI+Tika which has now been fixed.

(If you're using a custom built copy of SOLR, then you may just need to bump up the Tika dependency in the pom and re-build, and maven will take care of it for you. Otherwise a newer SOLR should have a newer Tika in as standard)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文