在 Ruby on Rails 中从各种文档格式中获取文本
我是 Rails 新手,但正在开发一个 Web 应用程序,需要从大型文本文件数据库中获取文本并以 html 格式显示文本。这些文件采用 .doc、.docx、.wps 和 .pages 格式,目前仅位于硬盘上。 .wps 和 .pages 中的文件数量足够少,我可以手动将它们转换为 .doc,但问题仍然存在:如何获取 .doc 或 .docx 文件中的文本以便保存它存入sqlite数据库供以后使用?
谢谢!
I'm new to Rails but am developing a web app that requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看看 Yomu。它是一个 gem,充当 Apache TIKA 的包装器,它支持多种文档格式,其中包括:
Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
这是一个漫长的迂回方式,但开放办公可以转换文件,并且有编程方法可以做到这一点:http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
这可能这还不是最好的方法,但也许它会起到一点润滑作用。
It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
That may not be the best way yet, but maybe it will grease the wheels a bit.