检测 rtf 标记中的多字节和中文字符
我正在尝试翻译解析 RTF 格式的消息(我需要保留格式标记,这样我就无法使用只需粘贴到 RichTextBox
中并获取 .PlainText< 的技巧/code> 取出)
将字符串 aKbমূcΟιd
的 RTF 代码直接粘贴到写字板中:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
如果您对 RTF 没有太多了解,则很难弄清楚。所以这是我正在看的部分
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
注意 K
(u+57FA
) 是 \'8a\'ee
但 মূ
,实际上是两个字符 ম
(\u2478?
) 和 ূ
(\u2498?
) >),是 \u2478?\u2498?
这很好,但是 Οι
是两个单独的字符 Ο
和 ι
是\'cf\'e9
。
有没有一种方法可以确定我正在查看的内容是否应该是一个字符,例如 K
= \'bb\'f9
或两个字符 Ο< /code> 和
ι
= \'cf\'e9
?
我想也许 \lang
就是这样,但事实并非如此,因为 \lang
从第一次设置时起就没有改变。我已经考虑了字体中不同 Charset
值的不同代码页,但它似乎没有告诉我是否应该将两个相邻的 Unicode 引用视为双字节性格与否。
如何判断我正在查看的字符应该是双字节(或多字节)还是单字节?
I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox
and get the .PlainText
out)
Take the RTF code for the string a基bমূcΟιd
pasted straight into Wordpad:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
Notice the 基
(u+57FA
) is \'8a\'ee
but the মূ
, which is actually two characters ম
(\u2478?
) and ূ
(\u2498?
), is \u2478?\u2498?
which is fine, but the Οι
which is two separate characters Ο
and ι
is \'cf\'e9
.
Is there a way to determine if I'm looking at something that should be one character such as 基
= \'bb\'f9
or two characters Ο
and ι
= \'cf\'e9
?
I was thinking that maybe the \lang
was it, but that isn't the case at all because the \lang
does not change from when it's first set. I am already accounting for the Different Codepages from different Charset
values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.
How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
\'xx
转义符表示字节,应使用fcharset
编码进行解释。 (或者可能是cchs
。如果不存在,则回退到ansicpg
。)您需要密切了解该编码才能确定单个
\'xx
序列代表一个字符本身或者只是多字节字符的一部分;通常,在使用可用的任何库或操作系统接口将该字节字符串转换为 Unicode 字符串之前,您将使用文本的每个部分作为一个单元,以避免必须为 RTF 支持的每个代码页编写逐字节解析器。\uxxxx?
转义符表示 UTF-16 代码单元。这要简单得多,但 Word[pad] 仅将这种形式的编码作为最后的手段,因为它与早期的 RTF 版本不兼容。 (?
是接收方无法处理 Unicode 时的后备字符。)因此:
两个字符
Οι
表示为两个字节转义,因为与该段文本关联的字体使用希腊单字节编码(字符集 161 = cp1253)。一个字符
K
表示为两个字节转义,因为与该文本片段关联的字体使用日语多字节编码(字符集 128 = cp932 ≈ Shift-JIS)。在 Shift-JIS 中,前导\'8a
字节表示即将到来的另一个字节,就像顶部位设置范围中的其他各个字节(但不是全部)一样。两个字符
মূ
表示为 Unicode 代码单元转义符,因为没有其他选择:没有任何包含孟加拉语字符的 RTF 兼容代码页。 (ISCII 的代码页 57003 出现得很晚。)\'xx
escapes represent bytes and should be interpreted using thefcharset
encoding. (Or potentiallycchs
. Falling back to theansicpg
if not present.)You need to know that encoding intimately to be able to decide whether a single
\'xx
sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.\uxxxx?
escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (?
is the fallback character for when the receiver can't cope with the Unicode.)So:
The two characters
Οι
are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).The one character
基
is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading\'8a
byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).The two characters
মূ
are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)RTF 具有用于指定用于对 Unicode 字符进行编码的代码页/编码的标签。字符的实际十六进制代码是指定编码使用的字节八位组。在本例中,
\ansicpg1252
表示 Ansi 代码页 1252。RTF has tags for specifying the codepage/encoding used to encode Unicode characters. The actual hex codes for the characters are the byte octets used by the specified encoding. In this case,
\ansicpg1252
for Ansi codepage 1252.