将 RTF 转换为 HTML 时，为什么无法正确维护蓝色引号行？

发布于 2024-12-07 23:19:34 字数 829 浏览 2 评论 0原文

我将正在处理的 Outlook 电子邮件回复保存为 RTF 文档，我已上传该文档。

我想做的是将这个 RTF 文档转换为 HTML。我尝试过各种不同的方法 - LibreOffice、各种转换实用程序，当然还有 Microsoft Word。大多数标记都可以很好地转换，但左侧的蓝色引号线似乎有些“神奇”。我只是无法准确地转换它们。

大多数转换实用程序只是完全删除它们。至于微软Word；当我最初打开文件时，它看起来很好（内联回复没有蓝色引号行，引用的文本有）。但是，当我在 Word 中将其保存为 HTML，然后打开该 HTML 文件时，蓝色引号行会一直保留到第一个回复（“确实如此。”），然后它就会消失。为什么蓝色引号线的剩余部分在转换过程中被破坏，我怎样才能让它们保留在那里？

顺便说一句，如果我将 Outlook 电子邮件保存为 DOCX 格式，在 Word 中打开它，然后将其另存为 HTML，则会出现完全相同的问题。这些引用行的实现方式似乎有一些专有和/或深奥的东西。请参阅下面的屏幕截图，了解它的外观（即，在我最初在 Word 中打开它之后）以及它的实际外观（即，将其保存为 HTML 格式之后）。

应该看起来像：
在此处输入图像描述

看起来像：
在此处输入图像描述

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

老子叫无熙 2024-12-14 23:19:34

好的，我一直在尝试使用此已保存电子邮件的 DOCX 版本（我将其保存为 RTF 和 DOCX 格式），并且我发现并解决了该问题。我猜测同样的问题以某种方式出现在该文件的 RTF 版本中，也许是因为 Microsoft 在 RTF 中实现蓝色引号的方式只是使用一些专有的 RTF 扩展来存储必要的额外样式数据，这些数据本来是无论如何，都存储在 DOCX 中 - 这可以解释为什么当我使用 MS Word 以外的任何工具打开所述 RTF 时，我会丢失引用行。由于 RTF 是一种相当丑陋的格式，而且我发现 DOCX 更容易使用，因此我将在下面描述我的 DOCX 修复。

DOCX 的问题是这样的：Word 在 Outlook 格式文档包的 document.xml 部分中定义了一堆段落，将其中一些段落链接到 divId，然后定义一个单独的 websettings.xml 部分来配合它。如果您通过按 Ctrl+Q 来分解 Outlook 中的蓝色引号（就像我创建此 DOCX 所做的那样），Word 会为每个段落添加带有相同 divId 的蓝色引号前缀的标记，然后只是在 websettings.xml 中定义了一个 divId；因此，您会在 document.xml 中得到类似的内容（我将其格式化得比从 MS Word 中获得的一个长字符串更好一点）：

<w:p w:rsidR="00ED60D7" w:rsidRPr="007B768D" w:rsidRDefault="00ED60D7" w:rsidP="007B768D">
    <w:pPr>
        <w:divId w:val="1800686860"/>
    </w:pPr>
    <w:r w:rsidRPr="007B768D">
       <w:t>Let's do some inline quoting when replying to it.</w:t>
    </w:r>
</w:p>

[...]

<w:p w:rsidR="00ED60D7" w:rsidRPr="007B768D" w:rsidRDefault="00ED60D7" w:rsidP="007B768D">
    <w:pPr>
        <w:divId w:val="1800686860"/>
    </w:pPr>
    <w:r w:rsidRPr="007B768D">
        <w:t>Best regards,</w:t>
    </w:r>
</w:p>

...以及在 中的类似内容>websettings.xml（格式再次变得更漂亮）：

<w:div w:id="1800686860">
    <w:marLeft w:val="0"/>
    <w:marRight w:val="0"/>
    <w:marTop w:val="0"/>
    <w:marBottom w:val="0"/>
    <w:divBdr>
        <w:top w:val="none" w:sz="0" w:space="0" w:color="auto"/>
        <w:left w:val="single" w:sz="12" w:space="4" w:color="0000FF"/>
        <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
        <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
    </w:divBdr>
    <w:divsChild>
        <w:div w:id="1800686861">
            <w:marLeft w:val="0"/>
            <w:marRight w:val="0"/>
            <w:marTop w:val="0"/>
            <w:marBottom w:val="0"/>
            <w:divBdr>
                <w:top w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:left w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
            </w:divBdr>
            <w:divsChild>
                <w:div w:id="1800686862">
                    <w:marLeft w:val="0"/>
                    <w:marRight w:val="0"/>
                    <w:marTop w:val="0"/>
                    <w:marBottom w:val="0"/>
                    <w:divBdr>
                        <w:top w:val="single" w:sz="8" w:space="3" w:color="B5C4DF"/>
                        <w:left w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                        <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                        <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                    </w:divBdr>
                </w:div>
            </w:divsChild>
        </w:div>
    </w:divsChild>
</w:div>

因此，websettings.xml 中定义的 w:div 在 document.xml 中被多次引用。现在，虽然当您在 MS Word 中以 DOCX 格式打开文件时，这似乎工作正常，但当您想要将文档转换为 HTML 时，它就变成了一个主要问题。看起来 XSLT 转换正在应用于 document.xml，并且因为在 XML 中，文档中应该只存在具有特定 ID 的单个元素，所以该转换仅应用 websettings.xml 样式到 document.xml 中的第一个段落，divId 为 1800686860。在我的示例中，这恰好是包含标题信息和第一行的段落（“发件人：Joe Bloggs [...]这是一封初始电子邮件。”）带有该 divId 不要接收 websettings.xml 中的样式。

因为 websettings.xml 中 divId 的 1800686860 的样式导致蓝色引号出现在左侧，所以我们将其余段落想要收到引言的人不要收到它，因为样式不会应用于任何其余段落！在我看来，这是 MS Word 中的一个令人讨厌的错误 - 它允许自己生成这样的 XML，从而导致 HTML 转换损坏。

修复？查找 document.xml 中具有重复 divId 的所有段落。记下它们。然后，对于每个具有重复项的 divId，在 websettings.xml 中创建其 w:div 元素的副本，并为该副本分配一个新的、唯一的document.xml 中每个重复实例的 ID。然后，将 document.xml 中的每个重复 ID 更改为副本之一。进行这些更改后（因此每个段落都真正链接到 websettings.xml 中的单独的、唯一的 w:div），并将修改后的 DOCX 保存为 HTML Word 中的文件...它有效！生成的 HTML 文件看起来与 DOCX 几乎相同，包括蓝色引号。

OK, I've been experimenting with the DOCX version of this saved e-mail (I saved it in both RTF and DOCX format), and I've found and remedied the problem with that. I'm guessing the same problem somehow made its way into the RTF version of the file, perhaps because the way Microsoft implements the blue quoteline in the RTF is just to use some proprietary RTF extension that stores the necessary extra styling data that would have been stored in the DOCX anyway - that would explain why I lose the quoteline when I use anything other than MS Word to open said RTF. As RTF is a rather ugly format and I find DOCX a lot easier to work with, I'll describe my DOCX fix below.

The problem with the DOCX was this: Word defines a bunch of paragraphs in the document.xml part of an Outlook-format document package, links some of them to divIds, and then defines a separate websettings.xml part to go along with it. If you break up the blue quoteline in Outlook by pressing Ctrl+Q, as I did to create this DOCX, Word tags each of the paragraphs to be prefixed with a blue quoteline with the same divId, and then just has that one divId defined in websettings.xml; so, you get something like this in document.xml (I've formatted it a bit more nicely than the one long string you get from MS Word):

<w:p w:rsidR="00ED60D7" w:rsidRPr="007B768D" w:rsidRDefault="00ED60D7" w:rsidP="007B768D">
    <w:pPr>
        <w:divId w:val="1800686860"/>
    </w:pPr>
    <w:r w:rsidRPr="007B768D">
       <w:t>Let's do some inline quoting when replying to it.</w:t>
    </w:r>
</w:p>

[...]

<w:p w:rsidR="00ED60D7" w:rsidRPr="007B768D" w:rsidRDefault="00ED60D7" w:rsidP="007B768D">
    <w:pPr>
        <w:divId w:val="1800686860"/>
    </w:pPr>
    <w:r w:rsidRPr="007B768D">
        <w:t>Best regards,</w:t>
    </w:r>
</w:p>

... and something like this in websettings.xml (formatting made prettier again):

<w:div w:id="1800686860">
    <w:marLeft w:val="0"/>
    <w:marRight w:val="0"/>
    <w:marTop w:val="0"/>
    <w:marBottom w:val="0"/>
    <w:divBdr>
        <w:top w:val="none" w:sz="0" w:space="0" w:color="auto"/>
        <w:left w:val="single" w:sz="12" w:space="4" w:color="0000FF"/>
        <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
        <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
    </w:divBdr>
    <w:divsChild>
        <w:div w:id="1800686861">
            <w:marLeft w:val="0"/>
            <w:marRight w:val="0"/>
            <w:marTop w:val="0"/>
            <w:marBottom w:val="0"/>
            <w:divBdr>
                <w:top w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:left w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
            </w:divBdr>
            <w:divsChild>
                <w:div w:id="1800686862">
                    <w:marLeft w:val="0"/>
                    <w:marRight w:val="0"/>
                    <w:marTop w:val="0"/>
                    <w:marBottom w:val="0"/>
                    <w:divBdr>
                        <w:top w:val="single" w:sz="8" w:space="3" w:color="B5C4DF"/>
                        <w:left w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                        <w:bottom w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                        <w:right w:val="none" w:sz="0" w:space="0" w:color="auto"/>
                    </w:divBdr>
                </w:div>
            </w:divsChild>
        </w:div>
    </w:divsChild>
</w:div>

So, the one w:div defined in websettings.xml is being referenced multiple times in document.xml. Now, although this seems to work fine when you open the file as a DOCX in MS Word, it becomes a major problem when you want to convert the document to HTML. It looks like an XSLT transformation is being applied to document.xml, and because in XML there should only ever be a single element in a document with a particular ID, the transformation only applies the websettings.xml styling to the first paragraph in document.xml with a divId of 1800686860. In my example, that happens to be the paragraph containing the header information and first line ("From: Joe Bloggs [...] This is an initial e-mail.") The remaining paragraphs with that divId DON'T receive the styling in websettings.xml.

Because it's the styling for a divId of 1800686860 in websettings.xml that causes the blue quoteline to appear on the left, the remaining paragraphs that we want to receive the quoteline don't receive it, because the styling isn't applied to any of the remaining paragraphs! In my opinion this is a nasty bug in MS Word - that it allows itself to generate XML like this that causes a broken HTML transform.

The fix? Find all paragraphs in document.xml with duplicate divIds. Make a note of them. Then, for each divId with duplicates, create a copy of its w:div element in websettings.xml and assign the copy a new, unique ID, for each duplicate instance in document.xml. Then, change each duplicate ID in document.xml to one of the copies. Once those changes are made (so each paragraph is genuinely linked to a separate, unique, w:div in websettings.xml), and you save the modified DOCX as an HTML file in Word... it works! The generated HTML file looks pretty much identical to the DOCX, blue quotelines included.