如何识别文件内容是ASCII还是二进制
如何使用 C++ 识别文件内容是 ASCII 还是二进制?
How do you identify the file content as being in ASCII or binary using C++?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
如何使用 C++ 识别文件内容是 ASCII 还是二进制?
How do you identify the file content as being in ASCII or binary using C++?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(11)
如果文件仅包含十进制字节 9–13、32–126,则它可能是纯 ASCII 文本文件。 否则就不是了。 但是,它可能仍然是另一种编码的文本。
如果除了上述字节之外,文件还仅包含十进制字节 128–255,则它可能是一个 8 位或可变长度 ASCII- 的文本文件。基于 ISO-8859-1、UTF-8 或 ASCII+Big5 的编码。 如果不是,出于某些目的,您可以在此停止并将该文件视为二进制文件。 但是,它可能仍然是 16 位或 32 位编码的文本。
如果文件不满足上述限制,请检查文件的前 2–4 个字节是否有 字节顺序标记:
FE FF
,则文件暂定UTF-16 BE。FF FE
,并且后面两个字节不是十六进制00 00
,则该文件暂定< /strong> UTF-16 LE。00 00 FE FF
,则该文件暂定 UTF-32 BE。FF FE 00 00
,则文件暂定为 UTF-32 LE。如果通过上述检查,您已经确定了暂定编码,则仅检查下面相应的编码,以确保该文件不是恰好与字节顺序标记匹配的二进制文件。
如果您尚未确定暂定编码,则该文件可能仍然是采用这些编码之一的文本文件,因为字节顺序标记不是强制性的,因此请检查以下列表中的所有编码:
如果在所有这些检查之后,您仍然没有确定编码,则该文件不是我所知道的任何基于 ASCII 的编码的文本文件,因此对于大多数用途,您可能可以将其视为二进制文件(它可能仍然是是非 ASCII 编码(例如 EBCDIC)的文本文件,但我怀疑这远远超出了您关心的范围)。
If a file contains only the decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.
If, in addition to the above bytes, the file contains only the decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.
If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:
FE FF
, the file is tentatively UTF-16 BE.FF FE
, and the following two bytes are not hex00 00
, the file is tentatively UTF-16 LE.00 00 FE FF
, the file is tentatively UTF-32 BE.FF FE 00 00
, the file is tentatively UTF-32 LE.If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.
If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:
If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).
您可以使用stream.get() 使用普通循环来迭代它,并检查您读取的字节值是否为
<= 127
。 多种方法中的一种:但是,正如有人提到的,所有文件毕竟都是二进制文件。 此外,不清楚“ascii”是什么意思。 如果您指的是字符代码,那么这确实是您要走的路。 但如果您仅指字母数字值,则需要另一种方法。
You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are
<= 127
. One way of many ways to do it:However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.
我的文本编辑器决定是否存在空字节。 实际上,这非常有效:没有空字节的二进制文件非常罕见。
My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.
每个文件的内容都是二进制的。 所以,在不了解其他情况的情况下,你无法确定。
ASCII 是一个解释问题。 如果您在文本编辑器中打开二进制文件,您就会明白我的意思。
大多数二进制文件都包含您可以查找的固定标头(每种类型),或者您可以将文件扩展名作为提示。 如果您需要 UTF 编码的文件,则可以查找字节顺序标记,但它们也是可选的。
除非你更仔细地定义你的问题,否则不可能有明确的答案。
The contents of every file is binary. So, knowing nothing else, you can't be sure.
ASCII is a matter of interpretation. If you open a binary file in a text editor, you see what I mean.
Most binary files contain a fixed header (per type) you can look for, or you can take the file extension as a hint. You can look for byte order marks if you expect UTF-encoded files, but they are optional as well.
Unless you define your question more closely, there can't be a definitive answer.
看看 文件命令 的工作原理; 它具有三种策略来确定文件的类型:
根据您的平台以及您可能感兴趣的文件,您可以查看其实现,甚至调用它。
Have a look a how the file command works ; it has three strategies to determine the type of a file:
Depending on your platform, and the possible files you're interested in, you can look at its implementation, or even invoke it.
如果问题确实是如何检测 ASCII,那么 litb 的答案就很准确。 然而,如果 san 知道如何确定文件是否包含文本,那么问题就会变得更加复杂。 ASCII 只是一种越来越不流行的表示文本的方式。 Unicode 系统 - UTF16、UTF32 和 UTF8 越来越受欢迎。 理论上,可以通过检查前两个字节是否为 unicode 字节顺序标记 (BOM) 0xFEFF(如果字节顺序相反则为 0xFFFE)来轻松测试它们。 然而,由于这两个字节搞乱了 Linux 系统的许多文件格式,因此不能保证它们存在。 此外,二进制文件可能以 0xFEFF 开头。
如果文件是 unicode,则查找 0x00(或其他控制字符)也无济于事。 如果文件是 UFT16,并且文件包含英文文本,则每隔一个字符将为 0x00。
如果您知道写入文本文件的语言,则可以分析字节并统计确定它是否包含文本。 例如,最常见的英语字母是 E,后跟 T。因此,如果文件包含的 E 和 T 多于 Z 和 X,则它很可能是文本。 当然,有必要将其作为 ASCII 和各种 unicode 进行测试来确定。
如果该文件不是用英语编写的 - 或者您想支持多种语言 - 那么剩下的唯一两个选择是查看 Windows 上的文件扩展名并根据“魔术文件”代码数据库检查前四个字节确定文件的类型以及它是否包含文本。
If the question is genuinely how to detect just ASCII, then litb's answer is spot on. However if san was after knowing how to determine whether the file contains text or not, then the issue becomes way more complex. ASCII is just one - increasingly unpopular - way of representing text. Unicode systems - UTF16, UTF32 and UTF8 have grown in popularity. In theory, they can be easily tested for by checking if the first two bytes are the unicocde byte order mark (BOM) 0xFEFF (or 0xFFFE if the byte order is reversed). However as those two bytes screw up many file formats for Linux systems, they cannot be guaranteed to be there. Further, a binary file might start with 0xFEFF.
Looking for 0x00's (or other control characters) won't help either if the file is unicode. If the file is UFT16 say, and the file contains English text, then every other character will be 0x00.
If you know the language that the text file will be written in, then it would be possible to analyse the bytes and statistically determine if it contains text or not. For example, the most common letter in English is E followed by T. So if the file contains lots more E's and T's than Z's and X's, it's likely text. Of course it would be necessary to test this as ASCII and the various unicodes to make sure.
If the file isn't written in English - or you want to support multiple languages - then the only two options left are to look at the file extension on Windows and to check the first four bytes against a database of "magic file" codes to determine the file's type and thus whether it contains text or not.
嗯,这取决于你对 ASCII 的定义。 您可以检查 ASCII 代码 <128 的值或您定义的某些字符集(例如“a”-“z”、“A”-“Z”、“0”-“9”...)并处理如果文件包含其他一些字符,则将其视为二进制文件。
您还可以检查常规换行符(0x10 或 0x13,0x10)来检测文本文件。
Well, this depends on your definition of ASCII. You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters.
You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.
要进行检查,您必须以二进制形式打开该文件。 您无法以文本形式打开该文件。 ASCII 实际上是二进制的子集。
之后,您必须检查字节值。 ASCII 的字节值是 0-127,但 0-31 是控制字符。 TAB、CR 和 LF 是唯一的通用控制字符。
您不能(可移植)使用“A”和“Z”; 不能保证它们是 ASCII 格式的 (!)。
如果您需要它们,则必须定义它们。
To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary.
After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters.
You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!).
If you need them, you'll have to define.
这个问题确实没有正确或错误的答案,只有复杂的解决方案,不适用于所有可能的文本文件。
这是旧新事物文章的链接 关于记事本如何检测ascii文件的类型。 它并不完美,但看看微软如何处理它很有趣。
This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.
Here is a link the a The Old New Thing Article on how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.
Github 的语言学家 使用 charlock holmes 库 来检测二进制文件,而二进制文件又使用 ICU< /a> 的字符集检测。
ICU 库可用于多种编程语言,包括 C 和 Java。
Github's linguist uses charlock holmes library to detect binary files, which in turn uses ICU's charset detection.
The ICU library is available for many programming languages, including C and Java.