当前位置：文江博客话题详情

将 RTF 中的 Codepage-1251 转换为 Unicode 的更好方法

发布于 2024-08-24 19:46:08 字数 661 浏览 7 评论 0原文

我正在尝试用各种语言解析 RTF（通过 MSEDIT），全部在 Delphi 2010 中，以便以 unicode 生成 HTML。

以俄语/西里尔语为起点，我发现整个文档代码页为 1252（西方），但文本的俄语部分由字体的字符集 (RUSSIAN_CHARSET 204) 标识。

到目前为止，我：

1）在解析 RTF 时使用 AnsiString （或 RawByteString）

2）通过从字体字符集查找来确定代码页（请参阅 http://msdn.microsoft.com/en-us/library/cc194829.aspx)

3) 在我的代码中使用查找表进行翻译：（这从 http://msdn.microsoft.com/en-gb/goglobal 生成的表/cc305144.aspx) - 我需要每个受支持的代码页一个表！

一定有比这更好的方法吗？最好是由操作系统提供的东西，因此比常量表更不易损坏。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

西瑶 2024-08-31 19:46:08

字符集到代码页表足够小，并且足够静态，我怀疑系统是否提供了执行此操作的函数。

要进行实际的字符转换，您可以使用 SysUtils.TEncoding 类或 System.SetCodePage 函数。两者内部都使用 MultiByteToWideString，它使用操作系统提供的查找表，因此您不需要维护它们。

使用 SetCodePage 看起来像这样：

var
  iStart, iStop: Integer;
  RTF, RawText: AnsiString;
  Text: string;
  CodePage: Word;
begin
   ...
   CodePage := CharSetToCodePage(CharSet);
   RawText := Copy(RTF, iStart, iStop - iStart);
   SetCodePage(RawText, CodePage, False); // Set string codepage to Russian without converting it
   Text := string(RawText); // Automatic conversion from string codepage to Unicode

The Charset to codepage table is small enough, and static enough, that I doubt the system provides a function to do it.

To do the actual character translations you can use the SysUtils.TEncoding class or the System.SetCodePage function. Both internally use MultiByteToWideString, which uses OS-provided lookup tables, so you don't need to maintain them.

Using SetCodePage would look something like this:

var
  iStart, iStop: Integer;
  RTF, RawText: AnsiString;
  Text: string;
  CodePage: Word;
begin
   ...
   CodePage := CharSetToCodePage(CharSet);
   RawText := Copy(RTF, iStart, iStop - iStart);
   SetCodePage(RawText, CodePage, False); // Set string codepage to Russian without converting it
   Text := string(RawText); // Automatic conversion from string codepage to Unicode

回复收藏 0 原文

~没有更多了~