在 C# 中操作 unicode 和 ASCII 字符集
我的 C# 应用程序中有此映射
字符串 [,] unicode2Ascii = { { “ஹ”, “\x86” } };
ஹ - 是泰米尔语文字“ஹ”的 unicode 值。这是 MS Word 作为字节序列保存的 unicode 值的原始十六进制文字。我试图将这些 unicode 值“字符串”映射到 255 以下的十六进制值(以便适应非 unicode 支持的系统)。
我尝试像这样使用 string.replace :
S = S.replace(unicode2Ascii[0,0], unicode2Ascii[0,1]);
但是生成的输出有一个 ?而不是实际存储的十六进制 0x86。有关如何将该数组的第二个元素的编码设置为 windows-1252 之类的内容的任何指针吗?
或者有更好的方法来进行这种转换吗?
提前致谢
I have this mapping in my C# application
string [,] unicode2Ascii = {
{ "ஹ", "\x86" }
};
ஹ - is the unicode value for a tamil literal "ஹ". This is the raw hex literal for the unicode value saved by MS Word as a byte sequence. I am trying to map these unicode value "strings" to a hex value under 255 (so as to accommodate non-unicode supported systems).
I trying to use string.replace like this:
S = S.replace(unicode2Ascii[0,0], unicode2Ascii[0,1]);
However the resultant ouput has a ? instead of the actual hex 0x86 stored. Any pointer on how I could set the encoding for the second element of that array to something like windows-1252?
Or is there a better way to do this conversion?
thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
.NET 中的字符串内部始终是 Unicode。不过这并不重要。字符串是一系列字符,.NET 字符串支持所有 unicode 字符。您不应该关心它们在内存中的呈现方式。仅当您的字符串离开(或进入).NET 时(即,当您将它们写入(读取)到文件、通过套接字将它们发送(接收)到其他系统等时),您才关心编码。这是当您使用 Encoding 类转换为您想要的任何编码时。在 .NET 字符串上替换字符或尝试任何编码技巧都是毫无意义的。
另外我推荐这篇文章http://www.joelonsoftware.com/articles/Unicode.html
Strings in .NET are always Unicode internally. However this does not really matter. Strings are a series in characters and .NET strings supports all unicode characters. You should not care how they are presented in memory. You care about encoding only when your strings leave (or enter) .NET (i.e. when you write (read) them to files, send (receive) them over sockets to other systems, etc.). This is when you use the Encoding class to convert to whatever encoding you desire. Replacing characters or trying any encoding tricks on .NET strings is pointless.
Also I recommend this article http://www.joelonsoftware.com/articles/Unicode.html
不确定这是否有帮助,但 Windows 支持泰米尔语代码页“57004 - ISCII Tamil”。
但它并没有为上面的示例字符提供相同的翻译。对于“ஹ”,它给出 216。也许需要使用不同的代码页?
更新
如果您希望将 unicode 文件作为输入,音译字符以获得单字节表示形式,则以下操作应该可以解决问题。如果您的字典对每个字符进行编码,则结果数组应该具有单字节表示形式:
Not sure if this helps, but the Tamil codepage "57004 - ISCII Tamil" is supported by Windows.
It does not give the same translation for the example character above though. For 'ஹ' it gives 216. Perhaps a different codepage needs to be used?
Update
If you wish to take a unicode file as input, transliterate characters to get a single byte representation, the following should do the trick. The resulting array should have your single byte representation if your dictionary encodes each character: