C# 中的阿拉伯语演示形式 B 支持
我试图使用 C# 中的编码 API 将文件从 utf-8 转换为阿拉伯语-1265 编码,但我遇到了一个奇怪的问题,即某些字符未正确转换,例如后面的语句“ﻣﺣﻣð ﺻﻼ í ðð分为”,它显示为“ﻣﺣﻣð ﺻ? í ð分为”。我的一些朋友告诉我,这是因为这些字符来自阿拉伯语演示形式 B。我使用 notepad++ 创建该文件并将其另存为 utf-8。
这是我使用的代码
StreamReader sr = new StreamReader(@"C:\utf-8.txt", Encoding.UTF8);
string str = sr.ReadLine();
StreamWriter sw = new StreamWriter(@"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
sw.Write(str);
sw.Flush();
sw.Close();
但是,我不知道如何使用 C# 中的演示文稿表单正确转换文件。
I was trying to convert a file from utf-8 to Arabic-1265 encoding using the Encoding APIs in C#, but I faced a strange problem that some characters are not converted correctly such as "لا" in the following statement "ﻣﺣﻣد ﺻﻼ ح عادل" it appears as "ﻣﺣﻣد ﺻ? ح عادل". Some of my friends told me that this is because these characters are from the Arabic Presentation Forms B. I create the file using notepad++ and save it as utf-8.
here is the code I use
StreamReader sr = new StreamReader(@"C:\utf-8.txt", Encoding.UTF8);
string str = sr.ReadLine();
StreamWriter sw = new StreamWriter(@"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
sw.Write(str);
sw.Flush();
sw.Close();
But, I don't know how to convert the file correctly using this presentation forms in C#.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
是的,您的字符串包含大量无法在 1256 代码页中表示的连字。在写入之前您必须分解该字符串。像这样:
Yes, your string contains lots of ligatures that cannot be represented in the 1256 code page. You'll have to decompose the string before writing it. Like this:
给出一个更通用的答案:
Windows-1256 编码是一种过时的 8 位字符编码。它只有 256 个字符,其中只有 60 个是阿拉伯字母。
Unicode 拥有更广泛的字符范围。特别是,它包含:
“正常”阿拉伯字符,U+0600 到 U+06FF。这些应该用于普通的阿拉伯文本,包括使用阿拉伯文字的其他语言(例如波斯语)编写的文本。例如,“¤”是 U+0644 (ä) 后接 U+0627 (â)。
“演示文稿形式”字符,U+FB50 至 U+FDFF(“演示文稿形式-A”)和 U+FE70 至 U+FEFF(“演示文稿形式-B”)。 这些并不是用于表示阿拉伯文本。它们主要是为了兼容性,特别是对于需要为每个字符的每种不同连接形式和连接字符组合使用单独代码点的字体文件格式。尽管是两个字符,但“ä”连字由单个代码点 (U+FEFB) 表示。
当编码为 Windows-1256 时,Windows-1256 的 .NET 编码会自动将演示表单块中的字符转换为“普通文本”因为它没有其他选择 em>(当然除了把它全部变成问号)。出于明显的原因,它只能对实际上具有“等效”的字符执行此操作。
从 Windows-1256 解码时,Windows-1256 的 .NET 编码将始终从“普通文本”块生成字符。
正如我们发现的,您的输入文件包含在 Windows-1256 中无法表示的字符。这些字符将变成问号(
?
)。此外,那些确实具有普通文本等效项的表示形式字符将改变其连接行为,因为这就是普通阿拉伯文本的作用。To give a more general answer:
The Windows-1256 encoding is an obsolete 8-bit character encoding. It has only 256 characters, of which only 60 are Arabic letters.
Unicode has a much wider range of characters. In particular, it contains:
the “normal” Arabic characters, U+0600 to U+06FF. These are supposed to be used for normal Arabic text, including text written in other languages that use the Arabic script, such as Farsi. For example, “لا” is U+0644 (ل) followed by U+0627 (ا).
the “Presentation Form” characters, U+FB50 to U+FDFF (“Presentation Forms-A”) and U+FE70 to U+FEFF (“Presentation Forms-B”). These are not intended to be used for representing Arabic text. They are primarily intended for compatibility, especially with font-file formats that require separate code points for every different ligated form of every character and ligated character combination. The “لا” ligature is represented by a single codepoint (U+FEFB) despite being two characters.
When encoding into Windows-1256, the .NET encoding for Windows-1256 will automatically convert characters from the Presentation Forms block to “normal text” because it has no other choice (except of course to turn it all into question marks). For obvious reasons, it can only do that with characters that actually have an “equivalent”.
When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the “normal text” block.
As we’ve discovered, your input file contains characters that are not representable in Windows-1256. Such characters will turn into question marks (
?
). Furthermore, those Presentation-Form characters which do have a normal-text equivalent, will change their ligation behaviour, because that is what normal Arabic text does.首先,您引用的两个字符不是来自阿拉伯语演示形式块。它们是
\x0644
和\x0627
,它们来自标准阿拉伯语块。然而,为了确保我尝试了字符\xFEFB
,它是演示表单块中 ä 的“等效”(不等效,但你知道)字符,即使这样它也能很好地工作。其次,我假设您指的是编码 Windows-1256,它适用于旧版 8 位阿拉伯文本。
所以我尝试了以下操作:
我得到的输出是 225, 199。因此,让我们尝试将其返回:
很公平,控制台没有正确显示结果 - 但调试器中的“监视”窗口告诉我答案是正确的(它显示“ä”)。我还可以复制控制台的输出,并且它在剪贴板中是正确的。
因此,Windows-1256 编码工作正常,目前尚不清楚您的问题是什么。
我的建议:
编写一小段代码来展示问题。
使用该代码段发布一个新问题。
在该问题中,准确描述您得到的结果以及您期望的结果。
First of all, the two characters you quoted are not from the Arabic Presentation Forms block. They are
\x0644
and\x0627
, which are from the standard Arabic block. However, just to be sure I tried the character\xFEFB
, which is the “equivalent” (not equivalent, but you know) character for لا from the Presentation Forms block, and it works fine even for that.Secondly, I will assume you mean the encoding Windows-1256, which is for legacy 8-bit Arabic text.
So I tried the following:
The output I get is
225, 199
. So let’s try to turn it back:Fair enough, the Console does not display the result correctly — but the Watch window in the debugger tells me that the answer is correct (it says “لا”). I can also copy the output from the Console and it is correct in the clipboard.
Therefore, the Windows-1256 encoding is working just fine and it is not clear what your problem is.
My recommendation:
Write a short piece of code that exhibits the problem.
Post a new question with that piece of code.
In that question, describe exactly what result you get, and what result you expected instead.