检测 rtf 标记中的多字节和中文字符

发布于 2024-12-18 05:58:10 字数 1389 浏览 3 评论 0原文

我正在尝试翻译解析 RTF 格式的消息(我需要保留格式标记,这样我就无法使用只需粘贴到 RichTextBox 中并获取 .PlainText< 的技巧/code> 取出)

将字符串 aKbমূcΟιd 的 RTF 代码直接粘贴到写字板中:

{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}

如果您对 RTF 没有太多了解,则很难弄清楚。所以这是我正在看的部分

\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9

注意 K (u+57FA) 是 \'8a\'eeমূ,实际上是两个字符 (\u2478?) 和 (\u2498?) >),是 \u2478?\u2498? 这很好,但是 Οι 是两个单独的字符 Οι\'cf\'e9

有没有一种方法可以确定我正在查看的内容是否应该是一个字符,例如 K = \'bb\'f9 或两个字符 Ο< /code> 和 ι = \'cf\'e9

我想也许 \lang 就是这样,但事实并非如此,因为 \lang 从第一次设置时起就没有改变。我已经考虑了字体中不同 Charset 值的不同代码页,但它似乎没有告诉我是否应该将两个相邻的 Unicode 引用视为双字节性格与否。

如何判断我正在查看的字符应该是双字节(或多字节)还是单字节?

I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox and get the .PlainText out)

Take the RTF code for the string a基bমূcΟιd pasted straight into Wordpad:

{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}

It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at

\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9

Notice the (u+57FA) is \'8a\'ee but the মূ, which is actually two characters (\u2478?) and (\u2498?), is \u2478?\u2498? which is fine, but the Οι which is two separate characters Ο and ι is \'cf\'e9.

Is there a way to determine if I'm looking at something that should be one character such as = \'bb\'f9 or two characters Ο and ι = \'cf\'e9?

I was thinking that maybe the \lang was it, but that isn't the case at all because the \lang does not change from when it's first set. I am already accounting for the Different Codepages from different Charset values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.

How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

倾听心声的旋律 2024-12-25 05:58:10

\'xx 转义符表示字节,应使用 fcharset 编码进行解释。 (或者可能是cchs。如果不存在,则回退到ansicpg。)

您需要密切了解该编码才能确定单个\'xx 序列代表一个字符本身或者只是多字节字符的一部分;通常,在使用可用的任何库或操作系统接口将该字节字符串转换为 Unicode 字符串之前,您将使用文本的每个部分作为一个单元,以避免必须为 RTF 支持的每个代码页编写逐字节解析器。

\uxxxx? 转义符表示 UTF-16 代码单元。这要简单得多,但 Word[pad] 仅将这种形式的编码作为最后的手段,因为它与早期的 RTF 版本不兼容。 (? 是接收方无法处理 Unicode 时的后备字符。)

因此:

  • 两个字符 Οι 表示为两个字节转义,因为与该段文本关联的字体使用希腊单字节编码(字符集 161 = cp1253)。

  • 一个字符 K 表示为两个字节转义,因为与该文本片段关联的字体使用日语多字节编码(字符集 128 = cp932 ≈ Shift-JIS)。在 Shift-JIS 中,前导 \'8a 字节表示即将到来的另一个字节,就像顶部位设置范围中的其他各个字节(但不是全部)一样。

  • 两个字符 মূ 表示为 Unicode 代码单元转义符,因为没有其他选择:没有任何包含孟加拉语字符的 RTF 兼容代码页。 (ISCII 的代码页 57003 出现得很晚。)

\'xx escapes represent bytes and should be interpreted using the fcharset encoding. (Or potentially cchs. Falling back to the ansicpg if not present.)

You need to know that encoding intimately to be able to decide whether a single \'xx sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.

\uxxxx? escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (? is the fallback character for when the receiver can't cope with the Unicode.)

So:

  • The two characters Οι are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).

  • The one character is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading \'8a byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).

  • The two characters মূ are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)

三岁铭 2024-12-25 05:58:10

RTF 具有用于指定用于对 Unicode 字符进行编码的代码页/编码的标签。字符的实际十六进制代码是指定编码使用的字节八位组。在本例中,\ansicpg1252 表示 Ansi 代码页 1252。

RTF has tags for specifying the codepage/encoding used to encode Unicode characters. The actual hex codes for the characters are the byte octets used by the specified encoding. In this case, \ansicpg1252 for Ansi codepage 1252.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文