检测 rtf 标记中的多字节和中文字符

发布于 2024-12-18 05:58:10 字数 1389 浏览 8 评论 0原文

我正在尝试翻译解析 RTF 格式的消息（我需要保留格式标记，这样我就无法使用只需粘贴到 RichTextBox 中并获取 .PlainText< 的技巧/code> 取出）

将字符串 aKbমূcΟιd 的 RTF 代码直接粘贴到写字板中：

{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}

如果您对 RTF 没有太多了解，则很难弄清楚。所以这是我正在看的部分

\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9

注意 K (u+57FA) 是 \'8a\'ee 但 মূ，实际上是两个字符 ম (\u2478?) 和 ূ (\u2498?) >),是 \u2478?\u2498? 这很好，但是 Οι 是两个单独的字符 Ο 和 ι是\'cf\'e9。

有没有一种方法可以确定我正在查看的内容是否应该是一个字符，例如 K = \'bb\'f9 或两个字符 Ο< /code> 和 ι = \'cf\'e9？

我想也许 \lang 就是这样，但事实并非如此，因为 \lang 从第一次设置时起就没有改变。我已经考虑了字体中不同 Charset 值的不同代码页，但它似乎没有告诉我是否应该将两个相邻的 Unicode 引用视为双字节性格与否。

如何判断我正在查看的字符应该是双字节（或多字节）还是单字节？

原文

I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox and get the .PlainText out)

Take the RTF code for the string a基bমূcΟιd pasted straight into Wordpad:

{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}

It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at

\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9

Notice the 基 (u+57FA) is \'8a\'ee but the মূ, which is actually two characters ম (\u2478?) and ূ (\u2498?), is \u2478?\u2498? which is fine, but the Οι which is two separate characters Ο and ι is \'cf\'e9.

Is there a way to determine if I'm looking at something that should be one character such as 基 = \'bb\'f9 or two characters Ο and ι = \'cf\'e9?

I was thinking that maybe the \lang was it, but that isn't the case at all because the \lang does not change from when it's first set. I am already accounting for the Different Codepages from different Charset values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.

How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾听心声的旋律 2024-12-25 05:58:10

\'xx 转义符表示字节，应使用 fcharset 编码进行解释。（或者可能是cchs。如果不存在，则回退到ansicpg。）

您需要密切了解该编码才能确定单个\'xx 序列代表一个字符本身或者只是多字节字符的一部分；通常，在使用可用的任何库或操作系统接口将该字节字符串转换为 Unicode 字符串之前，您将使用文本的每个部分作为一个单元，以避免必须为 RTF 支持的每个代码页编写逐字节解析器。

\uxxxx? 转义符表示 UTF-16 代码单元。这要简单得多，但 Word[pad] 仅将这种形式的编码作为最后的手段，因为它与早期的 RTF 版本不兼容。（? 是接收方无法处理 Unicode 时的后备字符。）

因此：

两个字符 Οι 表示为两个字节转义，因为与该段文本关联的字体使用希腊单字节编码（字符集 161 = cp1253）。
一个字符 K 表示为两个字节转义，因为与该文本片段关联的字体使用日语多字节编码（字符集 128 = cp932 ≈ Shift-JIS）。在 Shift-JIS 中，前导 \'8a 字节表示即将到来的另一个字节，就像顶部位设置范围中的其他各个字节（但不是全部）一样。
两个字符 মূ 表示为 Unicode 代码单元转义符，因为没有其他选择：没有任何包含孟加拉语字符的 RTF 兼容代码页。（ISCII 的代码页 57003 出现得很晚。）