有没有可以将 RTF 转换为纯文本的 Python 模块?

发布于 2024-08-03 05:38:45 字数 1539 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

浅听莫相离 2024-08-10 05:38:45

我一直在开发一个名为 Pyth 的库,它可以执行以下操作:

http://pypi.python。 org/pypi/pyth/

将 RTF 文件转换为纯文本看起来像这样:

from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter

doc = Rtf15Reader.read(open('sample.rtf'))

print PlaintextWriter.write(doc).getvalue()

Pyth 还可以生成 RTF 文件、读取和写入 XHTML、从 Python 标记生成文档(la Nevow 的 stan),并且对 Latex 的实验支持有限和pdf输出。它的 RTF 支持相当强大——我们在生产中使用它来读取由各种版本的 Word、OpenOffice、Mac TextEdit、EIOffice 等生成的 RTF 文件。

I've been working on a library called Pyth, which can do this:

http://pypi.python.org/pypi/pyth/

Converting an RTF file to plaintext looks something like this:

from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter

doc = Rtf15Reader.read(open('sample.rtf'))

print PlaintextWriter.write(doc).getvalue()

Pyth can also generate RTF files, read and write XHTML, generate documents from Python markup a la Nevow's stan, and has limited experimental support for latex and pdf output. Its RTF support is pretty robust -- we use it in production to read RTF files generated by various versions of Word, OpenOffice, Mac TextEdit, EIOffice, and others.

江南月 2024-08-10 05:38:45

OpenOffice 有一个 RTF 阅读器。您可以使用 python 编写 OpenOffice 脚本,请参阅此处了解详细信息

您可以尝试在 Windows 上使用神奇的 com-object 来读取任何带有 ms-binary 味道的内容。但我不建议这样做。

实际上解析原始数据可能不会很难,查看此示例 用 .bat/QBasic 编写。

DocFrac 是 RTF、HTML 和文本之间的免费开源转换器。 Windows、Linux、ActiveX 和 DLL 平台可用。将其包装在 python 中可能非常容易。

RTF::TEXT::Converter - 用于将 RTF 转换为文本的 Perl 扩展。 (如果您在使用 DocFrac 时遇到问题)。

官方富文本格式 (RTF) 规范,版本 1.7,由 Microsoft 提供。

祝你好运(在您的工作环境中权限有限)。

OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.

You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn't recommend that though.

Actually parsing the raw data probably won't be very hard, see this example written in .bat/QBasic.

DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.

RTF::TEXT::Converter - Perl extension for converting RTF into text. (in case You have problems withg DocFrac).

Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.

Good luck (with the limited privileges in Your working environment).

俏︾媚 2024-08-10 05:38:45

您查看过 pyrtf-ng 吗?

更新:如果您进行 Subversion 签出,则可以使用解析功能,但我不确定它的功能有多齐全。 (查看 rtfng.parser.base 模块。)

Have you checked out pyrtf-ng?

Update: The parsing functionality is available if you do a Subversion checkout, but I'm not sure how full-featured it is. (Look in the rtfng.parser.base module.)

信仰 2024-08-10 05:38:45

如果您使用的是 Mac ,您可以将 RTF 文件 file.rtf转换为 TXT CLI 类似:

textutil -convert txt file.rtf

If you are on Mac , you can convert an RTF file file.rtf to TXT from the CLI like:

textutil -convert txt file.rtf
泼猴你往哪里跑 2024-08-10 05:38:45

以下是使用正则表达式将 rtf 转换为文本的脚本的链接:
用于从 RTF 字符串中提取文本的正则表达式

另外,并更新了 github 上的链接:
Github 链接

Here's a link to a script that converts rtf to text using regex:
Regular Expression for extracting text from an RTF string

Also, and updated link on github:
Github link

不交电费瞎发啥光 2024-08-10 05:38:45

有一个很好的库 pyrtf-ng 用于通用 RTF 处理。

There is good library pyrtf-ng for all-purpose RTF handling.

独留℉清风醉 2024-08-10 05:38:45

PyRTF-ng 0.9.1 尚未解析我的任何 RTF 文档,均使用解析异常。
第一个文档是使用 OpenOffice 3.4 生成的,第二个文档是使用 Mac TextEdit 生成的。

Pyth 0.5.6 解析这两个文档没有问题,但没有正确处理西里尔字母符号。

但是每个编辑器都可以正确且顺利地打开其他编辑器文档,因此所有库似乎都对 rtf 支持较弱。

所以我正在用二十一点和妓女编写自己的解析器。

(这两个文件我都上传了,大家可以自行查看RTF库:http://yadi.sk/d/ RMHawVdSD8O9 http://yadi.sk/d/RmUaSe5tD8OD)

PyRTF-ng 0.9.1 has not parsed any of my RTF documents, both with the ParsingException.
First document was generated with OpenOffice 3.4, the second one with Mac TextEdit.

Pyth 0.5.6 parsed without problems both documents, but has not processed cyrillic symbols properly.

But each editor opens other's editor document correctly and without trouble, so all libraries seems to have a weak rtf support.

So I'm writing my own parser with with blackjack and hookers.

(I've uploaded both files, so you can check RTF libraries by yourself: http://yadi.sk/d/RMHawVdSD8O9 http://yadi.sk/d/RmUaSe5tD8OD)

把时间冻结 2024-08-10 05:38:45

我刚刚遇到 pyrtflib - 没有太多(任何)文档,这有点像安装的情况然后使用内置的 help() 函数来查找可用的内容以及所有内容的作用。

话虽如此,在我对其 rtf.Rtf2Html.getHtml() 函数的小试运行中,它运行得很好。我还没有尝试过 Rtf2Txt 函数,但考虑到将 rtf 转换为纯文本的简单性,它应该可以很好地完成我的期望。

I just came across pyrtflib - there's not much (any) documentation on it, it's kinda a case of installing it and then using the inbuilt help() function to find out what's available and what everything does.

Having said that in my little trial run of its rtf.Rtf2Html.getHtml() function it went well enough. I haven't tried the Rtf2Txt function but given the simpler nature of converting rtf to plaintext it should do fine I'd expect.

柏林苍穹下 2024-08-10 05:38:45

当我尝试自己编写代码时,我遇到了同样的事情。这并不那么容易,但是当我决定使用命令行应用程序时,这是我所拥有的。它是 ruby​​,但你可以很容易地适应 python。
有一些头部垃圾需要清理,但你或多或少可以看到这个想法。

f = File.open('r.rtf','r')
 b=0
 p=false
 str = ''
 begin
    while (char = f.readchar)
        if char.chr=='{'
   b+=1 
   next
  end
        if char.chr=='}'
   b-=1 
   next
  end
  if char.chr=='\\'
   p=true
   next
  end
  if p==true && (char.chr==' ' or char.chr=='\n' or char.chr=='\t' or char.chr=='\r')
   p=false 
   next
  end
  if p==true && (char.chr=='\'')
#this is the source of my headaches. you need to read the code page from the header and encode this.
   p=false 
   str << '#'
   next
  end
  next if b>2
  next if p
  str << char.chr
    end
rescue EOFError
end
f.close

I ran into the same thing ans I was trying to code it myself. It's not that easy but here is what I had when I decided to go for a commandline app. Its ruby but you can adapt to python very easily.
There is some header garbage to clean up, but you can see more or less the idea.

f = File.open('r.rtf','r')
 b=0
 p=false
 str = ''
 begin
    while (char = f.readchar)
        if char.chr=='{'
   b+=1 
   next
  end
        if char.chr=='}'
   b-=1 
   next
  end
  if char.chr=='\\'
   p=true
   next
  end
  if p==true && (char.chr==' ' or char.chr=='\n' or char.chr=='\t' or char.chr=='\r')
   p=false 
   next
  end
  if p==true && (char.chr=='\'')
#this is the source of my headaches. you need to read the code page from the header and encode this.
   p=false 
   str << '#'
   next
  end
  next if b>2
  next if p
  str << char.chr
    end
rescue EOFError
end
f.close
星星的軌跡 2024-08-10 05:38:45

相反,如果您想从Python轻松编写RTF,您可以使用第三方模块 rtflib。这是一个相当新且不完整的模块,但仍然非常强大和有用。下面的示例将富文本格式的“hello world”写入名为 helloworld.rtf 的 RTF。这是一个非常原始的示例,该模块还可以用于向 RTF 文件添加颜色、斜体、表格和富文本的许多其他方面。

from rtflib import *
file = RTF("helloworld.rtf")
file.startfile()
file.addstrict()
file.addtext("hello world")
file.writeout()

Conversely, if you want to write RTFs easily from Python, you can use the third-party module rtflib. It's a fairly new and incomplete module but still very powerful and useful. Below is an example that writes "hello world" in rich text to an RTF called helloworld.rtf. This is a very primitive example, and the module can also be used to add colors, italics, tables, and many other aspects of rich text to RTF files.

from rtflib import *
file = RTF("helloworld.rtf")
file.startfile()
file.addstrict()
file.addtext("hello world")
file.writeout()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文