如何计算复杂文档(.rtf、.doc、.odt 等)中的字数?
我正在尝试编写一个 Python 函数,在给定文档文件的路径的情况下,返回该文档中的单词数。使用 .txt 文件可以很容易地做到这一点,并且有一些工具可以让我同时支持一些更复杂的文档格式,但我想要一个真正全面的解决方案。
查看 OpenOffice.org 的 py-uno 脚本接口和支持的格式列表,将文档加载到无头 OOo 中并调用其字数统计函数似乎是理想的选择。但是,我找不到任何超出基本文档生成范围的 py-uno 教程或示例代码,甚至我找到的代码片段也已经过时了五年,不再起作用。
无论是否使用 OOo 和 Uno,如何获得各种格式文档的可靠字数统计?
I'm trying to write a Python function that, given the path to a document file, returns the number of words in that document. This is fairly easy to do with .txt files, and there are tools that allow me to hack support for a few more complex document formats together, but I want a really comprehensive solution.
Looking at OpenOffice.org's py-uno scripting interface and list of supported formats, it would seem ideal to load the documents in a headless OOo and call its word-count function. However, I can't find any py-uno tutorials or sample code that go beyond basic document generation, and even the code snippets I have found are out of date by a half-decade and no longer work.
Whether by using OOo and Uno or not, how can I get reliable word-counts for documents of various formats?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
将文档加载到无头 OOo
并调用其字数统计功能
PyODConverter 是一个最近(11-2009)脚本使用 OOo 转换多种文件类型。查看脚本,它可以基本加载所有 OOo 支持的文档。
这是将 OOo 作为无头服务启动的方式:
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
然后,您只需编写一个小型引导程序,在命令行上调用 OOo,运行脚本,然后关闭 OOo。
load the documents in a headless OOo
and call its word-count function
PyODConverter is a recent (11-2009) script to use OOo to convert multiple file types. Looking at the script, it has basic loading of all the OOo supported documents.
This is how you start OOo as a headless service:
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
Then you just have to write a small bootstrapper that calls OOo on the commandline, runs your script, then closes OOo.
这可能不适合您,但如果确实如此,您可以将文档上传到 Google 文档,然后以 .txt 格式导出。谷歌通常在转换方面做得非常好。
您可以在此处找到相关 API:http://code .google.com/intl/pl/apis/documents/docs/1.0/developers_guide_python.html
查看登录、上传和导出部分。
This could be not the option for you, but in case it is - you can upload documents to Google Docs and then export in .txt format. Google usually does very nice job for the conversion.
You can find relevant APIs here: http://code.google.com/intl/pl/apis/documents/docs/1.0/developers_guide_python.html
Take a look at login, uploading and exporting sections.