如何使一个文本文件具有多种编码?

发布于 2024-08-22 01:19:11 字数 623 浏览 6 评论 0原文

我有一个 ANSI 编码的文件。然而它里面显示了阿拉伯字母。这个文本文件是由某个程序生成的(我没有任何信息),但似乎有某种内部编码(如果我可以说并且如果可能的话)可以显示阿拉伯字母。

有这样的事吗?如果没有,ANSI文件如何显示阿拉伯字母?

*如果可能的话请在Java代码


版本01

中解释当我在Notepad ++中打开它时,它显示页面编码是ANSI。请检查这张照片:

http://www.4shared.com/file /221862075/e8705951/text-Windows.html


Edition 02

您可以从以下位置查看该文件:

http://www.4shared.com/file/221853641/3fa1af8c/data.html

I have a file which is ANSI encoded. However it shows Arabic letters inside it. this text file was generated by some program (I have no info on) but it seems like there is some kind of internal encoding (if I might say and if it's possible) for the Arabic letters to make appear.

Is there such a thing? If not, how can the ANSI file show the Arabic letters?

*If possible explain in Java code


Edition 01

When I open it in Notepad++ it shows that the page encoding is ANSI. Please check this photo:

http://www.4shared.com/file/221862075/e8705951/text-Windows.html


Edition 02

you can check the file at from:

http://www.4shared.com/file/221853641/3fa1af8c/data.html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

杯别 2024-08-29 01:19:11

简短回答:您的文本文件可能不是“ANSI”编码,而是 utf-8。

长答案

首先,术语“ANSI”(在 Windows 上)并不意味着固定编码;而是意味着固定编码。它的含义取决于您的语言设置。例如,在西欧和美国,通常为 Windows-1252(a ISO/IEC 8859-1,也称为 latin-1 的变体,日本,它是 SHIft JIS,在阿拉伯国家,它是 ISO/IEC_8859-6

如果您使用的是非阿拉伯语版本的 Windows 并且没有更改语言设置,并且当您在记事本中打开文件时可以在文件中看到阿拉伯字母,那么它肯定不采用任何这些 ANSI 编码。相反,它可能是 Unicode

请注意,我的意思不是“UNICODE”,它在 Windows 上通常表示 UTF-16LE。也可以是 UTF-8。两者都是可以对 Unicode 当前定义的所有 100.000 多个字符进行编码的编码,但它们的编码方式不同。两者都是可变长度编码,这意味着并非所有字符都使用相同数量的编码位。

在 UTF-8 中,每个字符都编码为一到四个字节。编码已选择为将 ASCII 字符编码为一个字节。

在 UTF-16 中,每个字符都编码为两个四个字节。这种编码最初是在 Unicode 字符数少于 64K 时发明的,因此可以将每个字符编码为单个 16 位字。后来,当人们清楚 Unicode 必须增长到超过 64K 限制时,发明了一种方案,其中使用 0xD800-0xDFFF 范围内的单词对来表示前 64K(减去 0x800)字符之外的字符。

要查看文件中的实际内容,请在十六进制编辑器中打开它:

  • 如果前两个字节是 FF FE,则可能是 UTF-16LE(小端)
  • 如果前两个字节是 FE FF,则可能是 UTF- 16BE(大尾数,在 Windows 上不太可能)
  • 如果前三个字节是 EF BB BF,那么很可能是 UTF-8
  • 如果您看到很多 00 字节,则很可能是 UTF-16(或 UTF-32,如果您看到00 字节对)
  • 如果阿拉伯字符占用一个字节,则很可能是 ISO-8859-6(例如,Í 将是 D5)。
  • 如果阿拉伯字符占用多个字节,则很可能是 UTF-8(例如,Í 将是 D8 B4)。

Short answer: Likely, your text file is not "ANSI"-encoded, but utf-8.

Long answer:

First, the term "ANSI" (on Windows) doesn't mean a fixed encoding; it's meaning depends on your language settings. For example, in Western Europe and USA, it will usually be Windows-1252 (a variant of ISO/IEC 8859-1, also known as latin-1), in Japan, it's SHift JIS, and in Arabic countries, it's ISO/IEC_8859-6.

If you are using a non-Arabic version of Windows and heave not changed your language settings, and you can see Arabic letters in the file when you open it in Notepad, then it is certainly not in any of these ANSI encodings. Instead, it is probably Unicode.

Note that I don't mean "UNICODE", which on Windows usually means UTF-16LE. It could be UTF-8 as well. Both are encodings that can encode all 100.000+ characters currently defined in Unicode, but they do it in different ways. Both are variable length encodings, meaning that not all characters are encoded using the same number of bits.

In UTF-8, each character is encoded as one to four bytes. The encoding has been chosen such that ASCII characters are encoded in one byte.

In UTF-16, each character is encoded as either two four bytes. This encoding has originally been invented when Unicode had fewer than 64K characters, and one therefore could encode every character in a single 16-bit word. Later, when it became clear that Unicode would have to grow beyond the 64K limit, a scheme was invented where pairs of words in the range 0xD800-0xDFFF are used to represent characters outside of the first 64K (minus 0x800) characters.

To see what's actually in the file, open it in a hex editor:

  • If the first two bytes are FF FE, then it is likely UTF-16LE (little endian)
  • If the first two bytes are FE FF, then it is likely UTF-16BE (big endian, unlikely on Windows)
  • If the first three bytes are EF BB BF, then it is likely UTF-8
  • If you see a lot of 00 Bytes, it is likely UTF-16 (or UTF-32, if you see pairs of 00 BYtes)
  • If Arabic characters occupy a single Byte, it is likely ISO-8859-6 (e.g. ش would be D5).
  • If Arabic characters occupy multiple Bytes, it is likely UTF-8 (e.g. ش would be D8 B4).
洋洋洒洒 2024-08-29 01:19:11

你怎么知道它是 ANSI 编码的?如果它不是像 UTF-8 这样的多字节编码,我的猜测是它是使用阿拉伯代码页进行编码的,如下所示: Windows-1256

您可以在十六进制编辑器中查看该文件,找出阿拉伯字符的数字,然后尝试找出它是使用哪种编码/代码页创建的。

How do you know that it's ANSI encoded? If it's not a multi-byte encoding like UTF-8, my guess would be it's encoded using an arabic code page like this one: Windows-1256.

You could look at the file in a Hex editor and find out what numbers the arabic characters have and that way try to find out which encoding / code page it was created with.

╰◇生如夏花灿烂 2024-08-29 01:19:11

有这样的事吗?

不。

如果没有,ANSI文件如何显示阿拉伯字母?

它不是 Windows-ANSI 编码文件。 更有可能的是,它使用 可变宽度编码,最有可能是 UTF-8:UTF-8 中的许多常见字符位置相当于 US-ASCII 中的位置(事实上,它就是这样设计的),并且推断也适用于 Windows- ANSI。

编辑:我们必须感谢微软造成的这种混乱。 “ANSI”在编码方面没有明确规定。通常它代表代码页 1252(“Windows-1252”)的 Windows 默认编码,它恰好对应于源自拉丁语的“西方”字母。

然而,在其他国家/地区,Windows 使用的默认编码(在较旧的 Windows 版本中……如今,默认为 UTF-8)不是 Windows-1252,而是一种不同的编码,也称为“ ANSI”。在本例中,代码页为 1256。

Is there such a thing?

No.

If not, how can the ANSI file show the Arabic letters?

It’s not a Windows-ANSI encoded file. More likely, it uses a variable-width encoding, most likely UTF-8: many common character positions in UTF-8 are equivalent to their positions in US-ASCII (in fact, it was designed that way), and by inference also for Windows-ANSI.

EDIT: We have to thank Microsoft for this confusion. “ANSI” isn’t well-specified when it comes to encodings. Usually it’s meant to stand for the Windows default encoding with codepage 1252 (“Windows-1252”), which happens to correspond to “Western” alphabets derived from Latin.

However, in other countries the default encoding used by Windows (in older Windows versions … today, the default is UTF-8) is not Windows-1252 but rather a different encoding, which is then also called “ANSI”. In this case, codepage 1256.

放低过去 2024-08-29 01:19:11

我尝试在 Firefox 和 Opera 中打开该文件。我必须将字符编码设置为阿拉伯语 Windows-1256 才能使其在两个浏览器中正确显示,因此文件的编码很可能就是这样。

笔记:
我最初将此作为评论发布,但被要求将其作为答案。

I tried opening the file in both Firefox and Opera. I had to set the character encoding to Arabic Windows-1256 to get it to display correctly in both browsers, so the file's encoding is most likely to be that.

NOTE:
I originally posted this as a comment, but was asked to make it an answer.

樱花细雨 2024-08-29 01:19:11

ANSI 字符编码允许 217 个字符,并且不包含阿拉伯字母。我认为该文件可能使用替代编码。

回答您的编辑,问题似乎出在 Notepad++ 上,因为显示的内容显然超出了 ANSI 字符集的能力。

ANSI character encoding allows for 217 characters and does not contain Arabic letters. I think perhaps the file uses an alternative encoding.

Anwsering your edit, it appears that the problem is with Notepad++, because what is being displayed is clearly beyond the capabilities of the ANSI charset.

迷雾森÷林ヴ 2024-08-29 01:19:11

首先我下载了​​你的文件并尝试使用 vim 来检查它的编码,它似乎不知道,在第二台机器上它说 latin1 这可能类似于 notepad++ 中发生的情况(给出了通用的答案)。
所以我做了file data.txt,输出是这样的:

data.txt: ISO-8859 text, with CRLF line terminators

希望这有帮助。

编辑
使用浏览器表明这个答案是不正确的。

ISO-8859-4 和 ISO-8859-13 可以显示文本,没有错误,但字符不是阿拉伯语。

first i downloaded your file and tried to use vim to check its encoding and it didn't seem to know and on a second machine it said latin1 which could be similar to what happened in notepad++ (gave the generic answer).
so i did file data.txt and the output was this:

data.txt: ISO-8859 text, with CRLF line terminators

hope this helps.

EDIT:
using the browser thing showed that this answer is incorrect.

ISO-8859-4 and ISO-8859-13 could display the text, without errors, but the characters where not in Arabic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文