从 Rails 应用程序(Word、PDF、Excel 等)搜索附件

发布于 2024-12-09 04:35:33 字数 988 浏览 0 评论 0原文

我在 Stack Overflow 上发表的第一篇文章,请温柔一点!我即将为客户启动一个新的 Ruby on Rails (3.1) 项目。他们的要求之一是有一个搜索引擎,该引擎将索引大约 2,000 个文档,这些文档是 PDF、Word、Excel 和 HTML 的混合体。

我本来希望使用 Thinking-sphinx 或 Texticle(最流行的是 https://www. ruby-toolbox.com/categories/rails_search.html),但据我了解:

所以我有两个选择:

  1. 选择不同的搜索工具
  2. 尝试提取纯文本- 将附件的文本版本放入数据库中,供thinking-sphinx 读取

您推荐哪种方法?

如果是不同的搜索工具,那么选择哪一种?我的要求非常基本,所以我真的很想要一个非常容易设置并且有大量文档、示例和教程的工具!

如果是提取,您能推荐常见文件类型(例如 PDF、Word、Excel 和 HTML)的提取器吗?

谢谢大家。非常感谢您的帮助。

My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.

I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:

So I'm left with two options:

  1. Pick a different search tool
  2. Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read

Which approach do you recommend?

If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!

If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?

Thanks everyone. Really appreciate your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏日落 2024-12-16 04:35:33

好吧,我以前没有做过二进制文件索引,但显然 Solr 支持它,请参阅 使用 SPHINX 索引文件/ultrasphinx
http://wiki.apache.org/solr/ExtractingRequestHandler 有很多可用的宝石对于 Solr 来说,Sunspot 似乎是一种流行的http://outoftime.github.com/sunspot/ 虽然 Sunspot 似乎没有内置对 Solr Cells 的支持,但似乎还有一些工作要做 https://github.com/tomasc/sunspot_cell 可能有更好的选择,但这应该给你一个好的开始 观点。

Well I have not done binary file indexing before, but apparently Solr has support for it see Indexing files with SPHINX/ultrasphinx and
http://wiki.apache.org/solr/ExtractingRequestHandler There are quite a few gems available for Solr, Sunspot seems to be a popular one http://outoftime.github.com/sunspot/ Although it seems Sunspot does not have built in support for Solr Cells, there seems to be some work going into it https://github.com/tomasc/sunspot_cell There are probably better options out there, but this should give you a good starting point.

伴随着你 2024-12-16 04:35:33

只是为了更新这个。我决定采用的方法是:

尝试将附件的纯文本版本提取到数据库中以供thinking-sphinx阅读

具体来说,我将执行以下操作:

看起来就像调用 java -jar tika-app-0.10.jar -t [file]< 一样简单/code> 但如果结果更复杂,我会发布我的经验!

Just to update this. The approach I've decided to go with is:

Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read

Specifically, I'll be doing the following:

  • Using thinking-sphinx
  • Using the subexec gem to call ...
  • ... Tika from the command line

It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file] but I'll post my experiences if it turns out to be more complicated!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文