如何发现将 RTF 十六进制文字转换为 Unicode 时要使用的代码页
我正在解析由 Word 2003+ 生成的 RTF 1.5+ 文件,这些文件可能包含其他语言的内容。此内容通常编码为十六进制文字 (\'xx)。我想将这些文字转换为 unicode 值。
我通过查找 ansicpg (\ansi\ansicpg1252) 知道我的文档的代码页。
当我使用 ansicpg 代码页解码为 Unicode 时,许多语言(如法语)似乎都会转换为我期望的 Unicode 字符值。
然而,当我看到俄语文本(如下所示)时,代码页 1252 将内容解码为乱码。
\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286 \'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\' ed\'e8\'ff。 \'dd\'f2 \'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\' e0 \'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2\'f2\'e0\'e1\' eb\'e8\'f6\'e5 \'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8。
我认为 lang1049、langfe1033、langnp1049 应该为我提供线索,以便我可以通过编程方式为它们引用的文本选择一个不同(非默认)代码页?如果是这样,我在哪里可以找到解释如何将 lang* 代码映射到代码页的信息?或者我应该寻找其他一些 RTF 命令/指令来为我提供我正在寻找的信息? (或者我必须使用 \f277 作为字体参考并查看它是否有关联的代码页?)
I'm parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\'xx). I would like to convert these literals to unicode values.
I know my document's code page by looking for ansicpg (\ansi\ansicpg1252).
When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.
However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.
\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286
\'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\'ed\'e8\'ff. \'dd\'f2
\'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\'e0
\'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2 \'f2\'e0\'e1\'eb\'e8\'f6\'e5
\'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8.
I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I'm looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
\lang
实际上仅将文本的特定部分标记为特定语言,并且不应该影响旧的非 Unicode\' 转义。
在标头中放入
\ansicpg
标记也许应该可以做到这一点,但似乎会被 Word 忽略(对于原始字节和\'
转义。看起来是这样。无论如何,更改分配给特定文本片段的字体的
\fcharset
是我可以让 Word 更改其处理字节的方式的唯一方法。令人恼火的是,此令牌中的代码(例如,参见此处获取列表)与语言 ID 或代码页号。\lang
really only marks up particular stretches of the text as being in a particular language, and shouldn't impact what code page is to be used for the old non-Unicode\'
escapes.Putting an
\ansicpg
token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and\'
escapes.It looks that way. Changing the
\fcharset
of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.不太清楚,但您可以根据 MSDN 使用 RichEdit 控件将 RTF 转换为 UTF-8 格式:
http://msdn.microsoft。 com/en-us/library/windows/desktop/bb774304(v=vs.85).aspx
查看 SF_USECODEPAGE 中的 EM_STREAMOUT 消息。
It is not so clear but you can use the RichEdit control in order to convert the RTF to UTF-8 format according to the MSDN:
http://msdn.microsoft.com/en-us/library/windows/desktop/bb774304(v=vs.85).aspx
Take a look to the SF_USECODEPAGE for the EM_STREAMOUT message.