从 Rails 应用程序(Word、PDF、Excel 等)搜索附件
我在 Stack Overflow 上发表的第一篇文章,请温柔一点!我即将为客户启动一个新的 Ruby on Rails (3.1) 项目。他们的要求之一是有一个搜索引擎,该引擎将索引大约 2,000 个文档,这些文档是 PDF、Word、Excel 和 HTML 的混合体。
我本来希望使用 Thinking-sphinx 或 Texticle(最流行的是 https://www. ruby-toolbox.com/categories/rails_search.html),但据我了解:
- Texticle 需要 PostgreSQL。我在MySQL上。
- Thinking-sphinx 不会在文件系统上索引文件。
- 即使我将附件保存到数据库中,thinking-sphinx 仍然无法工作,因为它需要纯文本(根据 http://groups.google.com/group/thinking-sphinx/browse_thread/thread/69cdc1c8e1c096ff)
所以我有两个选择:
- 选择不同的搜索工具
- 尝试提取纯文本- 将附件的文本版本放入数据库中,供thinking-sphinx 读取
您推荐哪种方法?
如果是不同的搜索工具,那么选择哪一种?我的要求非常基本,所以我真的很想要一个非常容易设置并且有大量文档、示例和教程的工具!
如果是提取,您能推荐常见文件类型(例如 PDF、Word、Excel 和 HTML)的提取器吗?
谢谢大家。非常感谢您的帮助。
My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.
I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:
- Texticle requires PostgreSQL. I'm on MySQL.
- thinking-sphinx doesn't index files on the file system.
- even if I saved my attachments into the database, thinking-sphinx still wouldn't work as it requires plain text (according to http://groups.google.com/group/thinking-sphinx/browse_thread/thread/69cdc1c8e1c096ff)
So I'm left with two options:
- Pick a different search tool
- Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read
Which approach do you recommend?
If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!
If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?
Thanks everyone. Really appreciate your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,我以前没有做过二进制文件索引,但显然 Solr 支持它,请参阅 使用 SPHINX 索引文件/ultrasphinx 和
http://wiki.apache.org/solr/ExtractingRequestHandler 有很多可用的宝石对于 Solr 来说,Sunspot 似乎是一种流行的http://outoftime.github.com/sunspot/ 虽然 Sunspot 似乎没有内置对 Solr Cells 的支持,但似乎还有一些工作要做 https://github.com/tomasc/sunspot_cell 可能有更好的选择,但这应该给你一个好的开始 观点。
Well I have not done binary file indexing before, but apparently Solr has support for it see Indexing files with SPHINX/ultrasphinx and
http://wiki.apache.org/solr/ExtractingRequestHandler There are quite a few gems available for Solr, Sunspot seems to be a popular one http://outoftime.github.com/sunspot/ Although it seems Sunspot does not have built in support for Solr Cells, there seems to be some work going into it https://github.com/tomasc/sunspot_cell There are probably better options out there, but this should give you a good starting point.
只是为了更新这个。我决定采用的方法是:
尝试将附件的纯文本版本提取到数据库中以供thinking-sphinx阅读
具体来说,我将执行以下操作:
看起来就像调用
java -jar tika-app-0.10.jar -t [file]< 一样简单/code> 但如果结果更复杂,我会发布我的经验!
Just to update this. The approach I've decided to go with is:
Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read
Specifically, I'll be doing the following:
It looks as if it will be as simple as calling
java -jar tika-app-0.10.jar -t [file]
but I'll post my experiences if it turns out to be more complicated!