如何正确显示日文RTF字体
我正在 Delphi 2009 中开发一个应用程序,该应用程序大量使用 RTF,并使用 TRichEdit 和 TLMDRichEdit 进行编辑。 在这些 RTF 控件中输入日语文本的用户不断提交报告,称在安装了东方语言支持的 Win XP 和 Vista 上重新加载内容时,日语文本显示为乱码。
通常,英语和日语是混合的,并且大部分显示都没有问题,例如:(
Inventory turns partnerships. 在庫回転率の
如果日语文本被错误地破坏,我深表歉意 - 我不会说或读这种语言)。
然而,经常只有文本的日语部分会出现乱码,例如:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
通过广泛的在线搜索,问题似乎是由于保存为 RTF 一部分的字体造成的。 日语版 Windows 上的字体不一定与美国英语版相同。 可以通过编程方式替换 RTF 文件中的字体,这会产生几乎可以接受的结果,即,
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
但是,其中仍然有相当多的“垃圾”字符无法正确识别为日语字符。 查看原始 RTF,您将看到以下内容:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
显然,Unicode 字符已正确呈现,但例如 \'82\'82 对字符应该是其他字符? 我的猜测是它实际上代表某种双字节字符,出于某种神秘的原因编码为两个单独的字符而不是单个 Unicode 字符。
是否有一种通用的、(相对)万无一失的方法来获取包含东方语言的 RTF 并可靠地再次显示它?
<我> 为了完整起见,我通过以下方式更新了 RTF 字体表:
- 替换了字体名称“?l?r ?o?S?V?b?N;” 与“\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;”
- 通过将“\froman\fprq1\fcharset0”替换为“\fnil\fprq1\fcharset128”更新了字体名称 通过将“
- \froman\fprq1\fcharset238”替换为“\fnil\fprq1\fcharset128”
- 更新了字体名称 通过替换“更新了字体名称\froman\fprq1 " 与 "\fnil\fprq1\fcharset128 "
- 替换字体名称 "?????;" 与“\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;”
更新:仅更新字体名称不会产生影响。 区域设置似乎是个大问题。 我见过一些网站讨论如何将日语 RTF 的显示转换为大多数读者可以处理的内容,但我还没有找到解决方案,例如: 此处和此处。
I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.
Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:
Inventory turns partnerships. 在庫回転率の
(my apologies if the Japanese text is broken incorrectly - I do not speak or read the language).
Quite frequently however, only the Japanese portion of the text will be gibberish, for example:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.
Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?
For completeness sake, I updated the RTF font table in the following way:
- Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
- Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
- Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
- Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
- Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我的猜测是,更改 RTF 中的字体名称可能会让事情变得更糟。 如果 RTF 中指定的字体不是 Unicode 字体,那么以该字体呈现的字符肯定会被编码为 Shift-JIS,而不是 Unicode。 然后文本中的其他角色也会如此。 因此,将整个内容视为 Unicode,或附加 Unicode 文本,将导致您看到的损坏。 您需要确定导入的 RTF 编码是 Shift-JIS 还是 Unicode,以及您运行的计算机(因此 D2009 默认输入格式)是否是日语。 在日本,如果文本文件没有 Unicode BOM,则通常是 Shift-JIS(但并非总是如此)。
My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).
我看到了类似的东西,但不是日语字体。 只是特殊字符,例如微(如微升)和上标。 问题是,即使我从 ASP.NET 网页发送给用户的 RTF 字符串是正确的(我可以使用 Fiddler2 看到编码的 RTF 流),但当 MS Word 实际打开 RTF 时,它添加了一堆垃圾转义符代码就像我在你的示例中看到的那样。
我所做的是通过一个转换例程运行整个 RTF 文本,该例程将 ascii 127 上的所有字符交换为其特殊的 unicode 点等效值。 所以我会得到类似 \uc1\u181 的东西? (微)用于特殊字符。 当我这样做时,Word 能够毫无问题地打开该文件。 讽刺的是,它重新编码了 \uc1\uxxx? 回到他们的 RTF 转义等价物。
不确定这是否能帮助您解决问题,但它对我有用。
I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.
What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.
Not sure if that will help your problem, but it's working for me.