将 RTF 转换为 HTML 时,为什么无法正确维护蓝色引号行?
我将正在处理的 Outlook 电子邮件回复保存为 RTF 文档,我已上传该文档。
我想做的是将这个 RTF 文档转换为 HTML。我尝试过各种不同的方法 - LibreOffice、各种转换实用程序,当然还有 Microsoft Word。大多数标记都可以很好地转换,但左侧的蓝色引号线似乎有些“神奇”。我只是无法准确地转换它们。
大多数转换实用程序只是完全删除它们。至于微软Word;当我最初打开文件时,它看起来很好(内联回复没有蓝色引号行,引用的文本有)。但是,当我在 Word 中将其保存为 HTML,然后打开该 HTML 文件时,蓝色引号行会一直保留到第一个回复(“确实如此。”),然后它就会消失。为什么蓝色引号线的剩余部分在转换过程中被破坏,我怎样才能让它们保留在那里?
顺便说一句,如果我将 Outlook 电子邮件保存为 DOCX 格式,在 Word 中打开它,然后将其另存为 HTML,则会出现完全相同的问题。这些引用行的实现方式似乎有一些专有和/或深奥的东西。请参阅下面的屏幕截图,了解它的外观(即,在我最初在 Word 中打开它之后)以及它的实际外观(即,将其保存为 HTML 格式之后)。
应该看起来像:
看起来像:
I saved an Outlook e-mail reply I was working on as an RTF document, which I've uploaded.
What I'd like to do is convert this RTF document to HTML. I've tried various different means - LibreOffice, various conversion utilities, and of course Microsoft Word. Most of the markup is converted fine, but there seems to be something 'magical' about the blue quote lines down the left. I just can't get them to be accurately converted.
Most conversion utilities just drop them altogether. As for Microsoft Word; when I open the file initially, it looks fine (inline replies have no blue quote line, quoted text does). However, when I save it to HTML in Word, and then open that HTML file, the blue quote line is retained up until the first reply ("Indeed it is."), and after that it disappears. Why are the remaining parts of the blue quote line being destroyed in the conversion process, and how can I get them to stay there?
By the way, the exact same problem happens if I instead save the Outlook e-mail in DOCX format, open that in Word, and save it as HTML. There seems to be something proprietary and/or esoteric about the way those quote lines are implemented. See below for screenshots of what it should look like (ie. after I initially open it in Word), and what it does look like (ie. after it's been saved to HTML format).
Should look like:
Does look like:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好的,我一直在尝试使用此已保存电子邮件的 DOCX 版本(我将其保存为 RTF 和 DOCX 格式),并且我发现并解决了该问题。我猜测同样的问题以某种方式出现在该文件的 RTF 版本中,也许是因为 Microsoft 在 RTF 中实现蓝色引号的方式只是使用一些专有的 RTF 扩展来存储必要的额外样式数据,这些数据本来是无论如何,都存储在 DOCX 中 - 这可以解释为什么当我使用 MS Word 以外的任何工具打开所述 RTF 时,我会丢失引用行。由于 RTF 是一种相当丑陋的格式,而且我发现 DOCX 更容易使用,因此我将在下面描述我的 DOCX 修复。
DOCX 的问题是这样的:Word 在 Outlook 格式文档包的
document.xml
部分中定义了一堆段落,将其中一些段落链接到divId
,然后定义一个单独的websettings.xml
部分来配合它。如果您通过按 Ctrl+Q 来分解 Outlook 中的蓝色引号(就像我创建此 DOCX 所做的那样),Word 会为每个段落添加带有相同divId
的蓝色引号前缀的标记,然后只是在websettings.xml
中定义了一个divId
;因此,您会在document.xml
中得到类似的内容(我将其格式化得比从 MS Word 中获得的一个长字符串更好一点):...以及在
中的类似内容>websettings.xml
(格式再次变得更漂亮):因此,websettings.xml 中定义的
w:div
在document.xml
中被多次引用。现在,虽然当您在 MS Word 中以 DOCX 格式打开文件时,这似乎工作正常,但当您想要将文档转换为 HTML 时,它就变成了一个主要问题。看起来 XSLT 转换正在应用于 document.xml,并且因为在 XML 中,文档中应该只存在具有特定 ID 的单个元素,所以该转换仅应用websettings.xml
样式到document.xml
中的第一个段落,divId
为1800686860
。在我的示例中,这恰好是包含标题信息和第一行的段落(“发件人:Joe Bloggs [...]这是一封初始电子邮件。”)带有该divId
不要接收websettings.xml
中的样式。因为
websettings.xml
中divId
的1800686860
的样式导致蓝色引号出现在左侧,所以我们将其余段落想要收到引言的人不要收到它,因为样式不会应用于任何其余段落!在我看来,这是 MS Word 中的一个令人讨厌的错误 - 它允许自己生成这样的 XML,从而导致 HTML 转换损坏。修复?查找
document.xml
中具有重复divId
的所有段落。记下它们。然后,对于每个具有重复项的divId
,在websettings.xml
中创建其w:div
元素的副本,并为该副本分配一个新的、唯一的document.xml
中每个重复实例的 ID。然后,将document.xml
中的每个重复 ID 更改为副本之一。进行这些更改后(因此每个段落都真正链接到websettings.xml
中的单独的、唯一的w:div
),并将修改后的 DOCX 保存为 HTML Word 中的文件...它有效!生成的 HTML 文件看起来与 DOCX 几乎相同,包括蓝色引号。OK, I've been experimenting with the DOCX version of this saved e-mail (I saved it in both RTF and DOCX format), and I've found and remedied the problem with that. I'm guessing the same problem somehow made its way into the RTF version of the file, perhaps because the way Microsoft implements the blue quoteline in the RTF is just to use some proprietary RTF extension that stores the necessary extra styling data that would have been stored in the DOCX anyway - that would explain why I lose the quoteline when I use anything other than MS Word to open said RTF. As RTF is a rather ugly format and I find DOCX a lot easier to work with, I'll describe my DOCX fix below.
The problem with the DOCX was this: Word defines a bunch of paragraphs in the
document.xml
part of an Outlook-format document package, links some of them todivId
s, and then defines a separatewebsettings.xml
part to go along with it. If you break up the blue quoteline in Outlook by pressing Ctrl+Q, as I did to create this DOCX, Word tags each of the paragraphs to be prefixed with a blue quoteline with the samedivId
, and then just has that onedivId
defined inwebsettings.xml
; so, you get something like this indocument.xml
(I've formatted it a bit more nicely than the one long string you get from MS Word):... and something like this in
websettings.xml
(formatting made prettier again):So, the one
w:div
defined in websettings.xml is being referenced multiple times indocument.xml
. Now, although this seems to work fine when you open the file as a DOCX in MS Word, it becomes a major problem when you want to convert the document to HTML. It looks like an XSLT transformation is being applied to document.xml, and because in XML there should only ever be a single element in a document with a particular ID, the transformation only applies thewebsettings.xml
styling to the first paragraph indocument.xml
with adivId
of1800686860
. In my example, that happens to be the paragraph containing the header information and first line ("From: Joe Bloggs [...] This is an initial e-mail.") The remaining paragraphs with thatdivId
DON'T receive the styling inwebsettings.xml
.Because it's the styling for a
divId
of1800686860
inwebsettings.xml
that causes the blue quoteline to appear on the left, the remaining paragraphs that we want to receive the quoteline don't receive it, because the styling isn't applied to any of the remaining paragraphs! In my opinion this is a nasty bug in MS Word - that it allows itself to generate XML like this that causes a broken HTML transform.The fix? Find all paragraphs in
document.xml
with duplicatedivId
s. Make a note of them. Then, for eachdivId
with duplicates, create a copy of itsw:div
element inwebsettings.xml
and assign the copy a new, unique ID, for each duplicate instance indocument.xml
. Then, change each duplicate ID indocument.xml
to one of the copies. Once those changes are made (so each paragraph is genuinely linked to a separate, unique,w:div
inwebsettings.xml
), and you save the modified DOCX as an HTML file in Word... it works! The generated HTML file looks pretty much identical to the DOCX, blue quotelines included.