如何使一个文本文件具有多种编码?
我有一个 ANSI 编码的文件。然而它里面显示了阿拉伯字母。这个文本文件是由某个程序生成的(我没有任何信息),但似乎有某种内部编码(如果我可以说并且如果可能的话)可以显示阿拉伯字母。
有这样的事吗?如果没有,ANSI文件如何显示阿拉伯字母?
*如果可能的话请在Java代码
版本01
中解释当我在Notepad ++中打开它时,它显示页面编码是ANSI。请检查这张照片:
http://www.4shared.com/file /221862075/e8705951/text-Windows.html
Edition 02
您可以从以下位置查看该文件:
I have a file which is ANSI encoded. However it shows Arabic letters inside it. this text file was generated by some program (I have no info on) but it seems like there is some kind of internal encoding (if I might say and if it's possible) for the Arabic letters to make appear.
Is there such a thing? If not, how can the ANSI file show the Arabic letters?
*If possible explain in Java code
Edition 01
When I open it in Notepad++ it shows that the page encoding is ANSI. Please check this photo:
http://www.4shared.com/file/221862075/e8705951/text-Windows.html
Edition 02
you can check the file at from:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
简短回答:您的文本文件可能不是“ANSI”编码,而是 utf-8。
长答案:
首先,术语“ANSI”(在 Windows 上)并不意味着固定编码;而是意味着固定编码。它的含义取决于您的语言设置。例如,在西欧和美国,通常为 Windows-1252(a ISO/IEC 8859-1,也称为 latin-1 的变体,日本,它是 SHIft JIS,在阿拉伯国家,它是 ISO/IEC_8859-6。
如果您使用的是非阿拉伯语版本的 Windows 并且没有更改语言设置,并且当您在记事本中打开文件时可以在文件中看到阿拉伯字母,那么它肯定不采用任何这些 ANSI 编码。相反,它可能是 Unicode。
请注意,我的意思不是“UNICODE”,它在 Windows 上通常表示 UTF-16LE。也可以是 UTF-8。两者都是可以对 Unicode 当前定义的所有 100.000 多个字符进行编码的编码,但它们的编码方式不同。两者都是可变长度编码,这意味着并非所有字符都使用相同数量的编码位。
在 UTF-8 中,每个字符都编码为一到四个字节。编码已选择为将 ASCII 字符编码为一个字节。
在 UTF-16 中,每个字符都编码为两个四个字节。这种编码最初是在 Unicode 字符数少于 64K 时发明的,因此可以将每个字符编码为单个 16 位字。后来,当人们清楚 Unicode 必须增长到超过 64K 限制时,发明了一种方案,其中使用 0xD800-0xDFFF 范围内的单词对来表示前 64K(减去 0x800)字符之外的字符。
要查看文件中的实际内容,请在十六进制编辑器中打开它:
Short answer: Likely, your text file is not "ANSI"-encoded, but utf-8.
Long answer:
First, the term "ANSI" (on Windows) doesn't mean a fixed encoding; it's meaning depends on your language settings. For example, in Western Europe and USA, it will usually be Windows-1252 (a variant of ISO/IEC 8859-1, also known as latin-1), in Japan, it's SHift JIS, and in Arabic countries, it's ISO/IEC_8859-6.
If you are using a non-Arabic version of Windows and heave not changed your language settings, and you can see Arabic letters in the file when you open it in Notepad, then it is certainly not in any of these ANSI encodings. Instead, it is probably Unicode.
Note that I don't mean "UNICODE", which on Windows usually means UTF-16LE. It could be UTF-8 as well. Both are encodings that can encode all 100.000+ characters currently defined in Unicode, but they do it in different ways. Both are variable length encodings, meaning that not all characters are encoded using the same number of bits.
In UTF-8, each character is encoded as one to four bytes. The encoding has been chosen such that ASCII characters are encoded in one byte.
In UTF-16, each character is encoded as either two four bytes. This encoding has originally been invented when Unicode had fewer than 64K characters, and one therefore could encode every character in a single 16-bit word. Later, when it became clear that Unicode would have to grow beyond the 64K limit, a scheme was invented where pairs of words in the range 0xD800-0xDFFF are used to represent characters outside of the first 64K (minus 0x800) characters.
To see what's actually in the file, open it in a hex editor:
你怎么知道它是 ANSI 编码的?如果它不是像 UTF-8 这样的多字节编码,我的猜测是它是使用阿拉伯代码页进行编码的,如下所示: Windows-1256。
您可以在十六进制编辑器中查看该文件,找出阿拉伯字符的数字,然后尝试找出它是使用哪种编码/代码页创建的。
How do you know that it's ANSI encoded? If it's not a multi-byte encoding like UTF-8, my guess would be it's encoded using an arabic code page like this one: Windows-1256.
You could look at the file in a Hex editor and find out what numbers the arabic characters have and that way try to find out which encoding / code page it was created with.
不。
它不是 Windows-ANSI 编码文件。更有可能的是,它使用 可变宽度编码,最有可能是 UTF-8:UTF-8 中的许多常见字符位置相当于 US-ASCII 中的位置(事实上,它就是这样设计的),并且推断也适用于 Windows- ANSI。编辑:我们必须感谢微软造成的这种混乱。 “ANSI”在编码方面没有明确规定。通常它代表代码页 1252(“Windows-1252”)的 Windows 默认编码,它恰好对应于源自拉丁语的“西方”字母。
然而,在其他国家/地区,Windows 使用的默认编码(在较旧的 Windows 版本中……如今,默认为 UTF-8)不是 Windows-1252,而是一种不同的编码,也称为“ ANSI”。在本例中,代码页为 1256。
No.
It’s not a Windows-ANSI encoded file.More likely, it uses a variable-width encoding, most likely UTF-8: many common character positions in UTF-8 are equivalent to their positions in US-ASCII (in fact, it was designed that way), and by inference also for Windows-ANSI.EDIT: We have to thank Microsoft for this confusion. “ANSI” isn’t well-specified when it comes to encodings. Usually it’s meant to stand for the Windows default encoding with codepage 1252 (“Windows-1252”), which happens to correspond to “Western” alphabets derived from Latin.
However, in other countries the default encoding used by Windows (in older Windows versions … today, the default is UTF-8) is not Windows-1252 but rather a different encoding, which is then also called “ANSI”. In this case, codepage 1256.
我尝试在 Firefox 和 Opera 中打开该文件。我必须将字符编码设置为阿拉伯语 Windows-1256 才能使其在两个浏览器中正确显示,因此文件的编码很可能就是这样。
笔记:
我最初将此作为评论发布,但被要求将其作为答案。
I tried opening the file in both Firefox and Opera. I had to set the character encoding to Arabic Windows-1256 to get it to display correctly in both browsers, so the file's encoding is most likely to be that.
NOTE:
I originally posted this as a comment, but was asked to make it an answer.
ANSI 字符编码允许 217 个字符,并且不包含阿拉伯字母。我认为该文件可能使用替代编码。
回答您的编辑,问题似乎出在 Notepad++ 上,因为显示的内容显然超出了 ANSI 字符集的能力。
ANSI character encoding allows for 217 characters and does not contain Arabic letters. I think perhaps the file uses an alternative encoding.
Anwsering your edit, it appears that the problem is with Notepad++, because what is being displayed is clearly beyond the capabilities of the ANSI charset.
首先我下载了你的文件并尝试使用 vim 来检查它的编码,它似乎不知道,在第二台机器上它说
latin1
这可能类似于 notepad++ 中发生的情况(给出了通用的答案)。所以我做了
file data.txt
,输出是这样的:希望这有帮助。
编辑:
使用浏览器表明这个答案是不正确的。
ISO-8859-4 和 ISO-8859-13 可以显示文本,没有错误,但字符不是阿拉伯语。
first i downloaded your file and tried to use vim to check its encoding and it didn't seem to know and on a second machine it said
latin1
which could be similar to what happened in notepad++ (gave the generic answer).so i did
file data.txt
and the output was this:hope this helps.
EDIT:
using the browser thing showed that this answer is incorrect.
ISO-8859-4 and ISO-8859-13 could display the text, without errors, but the characters where not in Arabic.