在 C# 中操作 unicode 和 ASCII 字符集

发布于 2024-10-10 13:30:41 字数 476 浏览 1 评论 0原文

我的 C# 应用程序中有此映射

字符串 [,] unicode2Ascii = { { “ஹ”, “\x86” } };

ஹ - 是泰米尔语文字“ஹ”的 unicode 值。这是 MS Word 作为字节序列保存的 unicode 值的原始十六进制文字。我试图将这些 unicode 值“字符串”映射到 255 以下的十六进制值（以便适应非 unicode 支持的系统）。

我尝试像这样使用 string.replace ：

S = S.replace(unicode2Ascii[0,0], unicode2Ascii[0,1]);

但是生成的输出有一个 ?而不是实际存储的十六进制 0x86。有关如何将该数组的第二个元素的编码设置为 windows-1252 之类的内容的任何指针吗？

或者有更好的方法来进行这种转换吗？

提前致谢

原文

I have this mapping in my C# application

string [,] unicode2Ascii = { { "ஹ", "\x86" } };

ஹ - is the unicode value for a tamil literal "ஹ". This is the raw hex literal for the unicode value saved by MS Word as a byte sequence. I am trying to map these unicode value "strings" to a hex value under 255 (so as to accommodate non-unicode supported systems).

I trying to use string.replace like this:

S = S.replace(unicode2Ascii[0,0], unicode2Ascii[0,1]);

However the resultant ouput has a ? instead of the actual hex 0x86 stored. Any pointer on how I could set the encoding for the second element of that array to something like windows-1252?

Or is there a better way to do this conversion?

thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

够钟 2024-10-17 13:30:42

.NET 中的字符串内部始终是 Unicode。不过这并不重要。字符串是一系列字符，.NET 字符串支持所有 unicode 字符。您不应该关心它们在内存中的呈现方式。仅当您的字符串离开（或进入）.NET 时（即，当您将它们写入（读取）到文件、通过套接字将它们发送（接收）到其他系统等时），您才关心编码。这是当您使用 Encoding 类转换为您想要的任何编码时。在 .NET 字符串上替换字符或尝试任何编码技巧都是毫无意义的。
另外我推荐这篇文章http://www.joelonsoftware.com/articles/Unicode.html

回复收藏 0 原文

遗忘曾经 2024-10-17 13:30:41

不确定这是否有帮助，但 Windows 支持泰米尔语代码页“57004 - ISCII Tamil”。

但它并没有为上面的示例字符提供相同的翻译。对于“ஹ”，它给出 216。也许需要使用不同的代码页？

        string tamilUnicodeString = "ஹ";

        Encoding encoding = Encoding.GetEncoding("x-iscii-ta");

        byte[] codepageBytes = encoding.GetBytes(tamilUnicodeString);

更新

如果您希望将 unicode 文件作为输入，音译字符以获得单字节表示形式，则以下操作应该可以解决问题。如果您的字典对每个字符进行编码，则结果数组应该具有单字节表示形式：

        Dictionary<char, char> lookup = new Dictionary<char, char>
        {
            { 'ஹ', '\x86' },
            { 'இ',  '\x87' },
            //next pair...,
            //etc, etc.
        };

        string input = "ஹஇதில் உள்ள தமிழ் எழுத்துக்கள் சரியாகத் தெரிந்தால்";

        char[] chars = input.ToCharArray();

        for (int i = 0; i < chars.Length; i++)
        {
            char replaceChar;

            if (lookup.TryGetValue(chars[i], out replaceChar))
            {
                chars[i] = replaceChar;
            }
        }

        byte[] output = Encoding.GetEncoding("iso-8859-1").GetBytes(chars);

Not sure if this helps, but the Tamil codepage "57004 - ISCII Tamil" is supported by Windows.

It does not give the same translation for the example character above though. For 'ஹ' it gives 216. Perhaps a different codepage needs to be used?

        string tamilUnicodeString = "ஹ";

        Encoding encoding = Encoding.GetEncoding("x-iscii-ta");

        byte[] codepageBytes = encoding.GetBytes(tamilUnicodeString);

Update

If you wish to take a unicode file as input, transliterate characters to get a single byte representation, the following should do the trick. The resulting array should have your single byte representation if your dictionary encodes each character:

        Dictionary<char, char> lookup = new Dictionary<char, char>
        {
            { 'ஹ', '\x86' },
            { 'இ',  '\x87' },
            //next pair...,
            //etc, etc.
        };

        string input = "ஹஇதில் உள்ள தமிழ் எழுத்துக்கள் சரியாகத் தெரிந்தால்";

        char[] chars = input.ToCharArray();

        for (int i = 0; i < chars.Length; i++)
        {
            char replaceChar;

            if (lookup.TryGetValue(chars[i], out replaceChar))
            {
                chars[i] = replaceChar;
            }
        }

        byte[] output = Encoding.GetEncoding("iso-8859-1").GetBytes(chars);

回复收藏 0 原文

~没有更多了~