如何将 unicode 字符串输出为 RTF(使用 C#)

发布于 2024-08-03 05:03:45 字数 639 浏览 2 评论 0原文

我正在尝试将 unicode 字符串输出为 RTF 格式。 (使用 C# 和 winforms)

来自维基百科

如果需要 Unicode 转义,则使用控制字 \u,后跟给出 Unicode 代码点编号的 16 位有符号十进制整数。为了使不支持 Unicode 的程序受益,后面必须跟上指定代码页中该字符的最接近的表示形式。例如,\u1576?将给出阿拉伯字母 beh,指定不支持 Unicode 的旧程序应将其呈现为问号。

我不知道如何将 Unicode 字符转换为 Unicode 代码点(“\u1576”)。 转换为 UTF 8、UTF 16 和类似格式很容易,但我不知道如何转换为代码点。

我使用这个的场景:

  • 我将现有的 RTF 文件读入字符串(我正在读取模板)
  • string.replace #TOKEN# 为 MyUnicodeString (模板填充了数据)
  • 将结果写入另一个 RTF 文件。

问题,当 Unicode 字符到达时出现

I'm trying to output unicode string into RTF format. (using c# and winforms)

From wikipedia:

If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.

I don't know how to convert Unicode character into Unicode codepoint ("\u1576").
Conversion to UTF 8, UTF 16 and similar is easy, but I don't know how to convert to codepoint.

Scenario in which I use this:

  • I read existing RTF file into string (I'm reading template)
  • string.replace #TOKEN# with MyUnicodeString (template is populate with data)
  • write result into another RTF file.

Problem, arise when Unicode characters arrived

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

风蛊 2024-08-10 05:03:45

前提是您要迎合的所有字符都存在于基本多语言平面中(不太可能你还需要更多),那么简单的 UTF-16 编码就足够了。

维基百科:

从 U+0000 开始的所有可能的代码点
通过 U+10FFFF,除了
代理码点 U+D800–U+DFFF
(不是字符),是
无论如何,由 UTF-16 唯一映射
代码点的当前或未来
字符分配或使用。

以下示例程序说明了按照您想要的方式执行某些操作:

static void Main(string[] args)
{
    // ë
    char[] ca = Encoding.Unicode.GetChars(new byte[] { 0xeb, 0x00 });
    var sw = new StreamWriter(@"c:/helloworld.rtf");
    sw.WriteLine(@"{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs60 H" + GetRtfUnicodeEscapedString(new String(ca)) + @"llo, World!
}"); 
    sw.Close();
}

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}

重要的一点是 Convert.ToUInt32(c) ,它本质上返回相关字符的代码点值。 unicode 的 RTF 转义需要十进制 unicode 值。根据 MSDN 文档,System.Text.Encoding.Unicode 编码对应于 UTF-16。

Provided that all the characters that you're catering for exist in the Basic Multilingual Plane (it's unlikely that you'll need anything more), then a simple UTF-16 encoding should suffice.

Wikipedia:

All possible code points from U+0000
through U+10FFFF, except for the
surrogate code points U+D800–U+DFFF
(which are not characters), are
uniquely mapped by UTF-16 regardless
of the code point's current or future
character assignment or use.

The following sample program illustrates doing something along the lines of what you want:

static void Main(string[] args)
{
    // ë
    char[] ca = Encoding.Unicode.GetChars(new byte[] { 0xeb, 0x00 });
    var sw = new StreamWriter(@"c:/helloworld.rtf");
    sw.WriteLine(@"{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs60 H" + GetRtfUnicodeEscapedString(new String(ca)) + @"llo, World!
}"); 
    sw.Close();
}

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}

The important bit is the Convert.ToUInt32(c) which essentially returns the code point value for the character in question. The RTF escape for unicode requires a decimal unicode value. The System.Text.Encoding.Unicode encoding corresponds to UTF-16 as per the MSDN documentation.

长亭外,古道边 2024-08-10 05:03:45

修复了已接受答案中的代码 - 添加了特殊字符转义,如此链接中所述

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if(c == '\\' || c == '{' || c == '}')
            sb.Append(@"\" + c);
        else if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}

Fixed code from accepted answer - added special character escaping, as described in this link

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if(c == '\\' || c == '{' || c == '}')
            sb.Append(@"\" + c);
        else if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}
深海夜未眠 2024-08-10 05:03:45

您必须将字符串转换为 byte[] 数组(使用 Encoding.Unicode.GetBytes(string)),然后循环该数组并在前面添加 \u 字符替换您找到的所有 Unicode 字符。然后,当您将数组转换回字符串时,必须将 Unicode 字符保留为数字。

例如,如果您的数组如下所示:

byte[] unicodeData = new byte[] { 0x15, 0x76 };

它将变成:

// 5c = \, 75 = u
byte[] unicodeData = new byte[] { 0x5c, 0x75, 0x15, 0x76 };

You will have to convert the string to a byte[] array (using Encoding.Unicode.GetBytes(string)), then loop through that array and prepend a \ and u character to all Unicode characters you find. When you then convert the array back to a string, you'd have to leave the Unicode characters as numbers.

For example, if your array looks like this:

byte[] unicodeData = new byte[] { 0x15, 0x76 };

it would become:

// 5c = \, 75 = u
byte[] unicodeData = new byte[] { 0x5c, 0x75, 0x15, 0x76 };
小猫一只 2024-08-10 05:03:45

根据规范,以下是一些经过测试且有效的 java 代码:

  public static String escape(String s){
        if (s == null) return s;

        int len = s.length();
        StringBuilder sb = new StringBuilder(len);
        for (int i = 0; i < len; i++){
            char c = s.charAt(i);
            if (c >= 0x20 && c < 0x80){
                if (c == '\\' || c == '{' || c == '}'){
                    sb.append('\\');
                }
                sb.append(c);
            }
            else if (c < 0x20 || (c >= 0x80 && c <= 0xFF)){
                sb.append("\'");
                sb.append(Integer.toHexString(c));
            }else{
                sb.append("\\u");
                sb.append((short)c);
                sb.append("??");//two bytes ignored
            }
        }
        return sb.toString();
 }

重要的是,您需要在转义的 uncode 之后附加 2 个字符(接近 unicode 字符或仅使用 ? 代替)。因为unicode占用2个字节。

另外,规范还规定,如果代码点大于 32767,则应使用负值,但在我的测试中,如果不使用负值也没关系。

规范如下:

\uN 该关键字表示单个 Unicode 字符,根据当前 ANSI 代码页,该字符没有等效的 ANSI 表示形式。 N 表示以十进制数表示的 Unicode 字符值。
该关键字后面紧跟着 ANSI 表示形式中的等效字符。这样,老读者将忽略 \uN 关键字并正确拾取 ANSI 表示形式。当遇到这个关键字时,读者应该忽略接下来的 N 个字符,其中 N 对应于最后遇到的 \ucN 值。

与所有 RTF 关键字一样,可能存在关键字终止空格(在 ANSI 字符之前),该空格不计入要跳过的字符中。虽然这种情况不太可能发生(或推荐),但出于跳过目的,\bin 关键字、其参数以及后面的二进制数据被视为一个字符。如果扫描可跳过数据时遇到 RTF 范围分隔符(即左大括号或右大括号),则认为可跳过数据在分隔符之前结束。这使得读者可以执行一些基本的错误恢复。要在可跳过数据中包含 RTF 分隔符,必须使用适当的控制符号(即用反斜杠转义)来表示它,就像纯文本一样。出于计算可跳过字符的目的,任何 RTF 控制字或符号都被视为单个字符。

RTF 编写器在遇到没有对应 ANSI 字符的 Unicode 字符时,应输出 \uN,后跟它可以管理的最佳 ANSI 表示形式。此外,如果 Unicode 字符转换为 ANSI 字符流,且其字节数与当前 Unicode 字符字节数不同,则应在 \uN 关键字之前发出 \ucN 关键字,以通知读者更改。

RTF 控制字通常接受带符号的 16 位数字作为参数。因此,大于 32767 的 Unicode 值必须表示为负数

Based on the specification, here are some code in java which is tested and works:

  public static String escape(String s){
        if (s == null) return s;

        int len = s.length();
        StringBuilder sb = new StringBuilder(len);
        for (int i = 0; i < len; i++){
            char c = s.charAt(i);
            if (c >= 0x20 && c < 0x80){
                if (c == '\\' || c == '{' || c == '}'){
                    sb.append('\\');
                }
                sb.append(c);
            }
            else if (c < 0x20 || (c >= 0x80 && c <= 0xFF)){
                sb.append("\'");
                sb.append(Integer.toHexString(c));
            }else{
                sb.append("\\u");
                sb.append((short)c);
                sb.append("??");//two bytes ignored
            }
        }
        return sb.toString();
 }

The important thing is, you need to append 2 characters (close to the unicode character or just use ? instead) after the escaped uncode. because the unicode occupy 2 bytes.

Also the spec says your should use negative value if the code point greater than 32767, but in my test, it's fine if you don't use negative value.

Here is the spec:

\uN This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.

As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters.

An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.

RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative number

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文