当前位置：文江博客话题详情

如何识别文件内容是ASCII还是二进制

发布于 2024-07-08 19:37:40 字数 36 浏览 8 评论 0原文

如何使用 C++ 识别文件内容是 ASCII 还是二进制？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

简美 2024-07-15 19:37:40

如果文件仅包含十进制字节 9–13、32–126，则它可能是纯 ASCII 文本文件。否则就不是了。但是，它可能仍然是另一种编码的文本。

如果除了上述字节之外，文件还仅包含十进制字节 128–255，则它可能是一个 8 位或可变长度 ASCII- 的文本文件。基于 ISO-8859-1、UTF-8 或 ASCII+Big5 的编码。如果不是，出于某些目的，您可以在此停止并将该文件视为二进制文件。但是，它可能仍然是 16 位或 32 位编码的文本。

如果文件不满足上述限制，请检查文件的前 2–4 个字节是否有字节顺序标记：

如果前两个字节是十六进制FE FF，则文件暂定UTF-16 BE。
如果前两个字节是十六进制 FF FE，并且后面两个字节不是十六进制 00 00 ，则该文件暂定< /strong> UTF-16 LE。
如果前四个字节是十六进制 00 00 FE FF，则该文件暂定 UTF-32 BE。
如果前四个字节是十六进制 FF FE 00 00，则文件暂定为 UTF-32 LE。

如果通过上述检查，您已经确定了暂定编码，则仅检查下面相应的编码，以确保该文件不是恰好与字节顺序标记匹配的二进制文件。

如果您尚未确定暂定编码，则该文件可能仍然是采用这些编码之一的文本文件，因为字节顺序标记不是强制性的，因此请检查以下列表中的所有编码：

如果文件包含 仅具有十进制值 9-13、32-126 和 128 或以上的大端法两字节字，该文件可能是 UTF-16 BE。
如果文件仅包含十进制值为 9-13、32-126 和 128 或以上的小尾数法两字节字，则该文件可能是 UTF-16 LE。
如果文件仅包含十进制值 9-13、32-126 和 128 或以上的大端四字节字，则该文件可能是 UTF-32 BE。
如果文件仅包含十进制值为 9-13、32-126 和 128 或以上的小端四字节字，则该文件可能是 UTF-32 LE。

如果在所有这些检查之后，您仍然没有确定编码，则该文件不是我所知道的任何基于 ASCII 的编码的文本文件，因此对于大多数用途，您可能可以将其视为二进制文件（它可能仍然是是非 ASCII 编码（例如 EBCDIC）的文本文件，但我怀疑这远远超出了您关心的范围）。

If a file contains only the decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.

If, in addition to the above bytes, the file contains only the decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.

If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:

If the first two bytes are hex FE FF, the file is tentatively UTF-16 BE.
If the first two bytes are hex FF FE, and the following two bytes are not hex 00 00 , the file is tentatively UTF-16 LE.
If the first four bytes are hex 00 00 FE FF, the file is tentatively UTF-32 BE.
If the first four bytes are hex FF FE 00 00, the file is tentatively UTF-32 LE.

If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.

If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:

If the file contains only big-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 BE.
If the file contains only little-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 LE.
If the file contains only big-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 BE.
If the file contains only little-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 LE.

If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).

回复收藏 0 原文

对风讲故事 2024-07-15 19:37:40

您可以使用stream.get() 使用普通循环来迭代它，并检查您读取的字节值是否为<= 127。多种方法中的一种：

int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127) 
    ;
if(c == EOF) {
    /* file is all ASCII */
}

但是，正如有人提到的，所有文件毕竟都是二进制文件。此外，不清楚“ascii”是什么意思。如果您指的是字符代码，那么这确实是您要走的路。但如果您仅指字母数字值，则需要另一种方法。

You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are <= 127. One way of many ways to do it:

int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127) 
    ;
if(c == EOF) {
    /* file is all ASCII */
}

However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.

回复收藏 0 原文

独闯女儿国 2024-07-15 19:37:40

我的文本编辑器决定是否存在空字节。实际上，这非常有效：没有空字节的二进制文件非常罕见。

回复收藏 0 原文

时常饿 2024-07-15 19:37:40

每个文件的内容都是二进制的。所以，在不了解其他情况的情况下，你无法确定。

ASCII 是一个解释问题。如果您在文本编辑器中打开二进制文件，您就会明白我的意思。

大多数二进制文件都包含您可以查找的固定标头（每种类型），或者您可以将文件扩展名作为提示。如果您需要 UTF 编码的文件，则可以查找字节顺序标记，但它们也是可选的。

除非你更仔细地定义你的问题，否则不可能有明确的答案。

回复收藏 0 原文

一抹苦笑 2024-07-15 19:37:40

看看文件命令的工作原理；它具有三种策略来确定文件的类型：

文件系统测试
幻数测试
和语言测试

根据您的平台以及您可能感兴趣的文件，您可以查看其实现，甚至调用它。

回复收藏 0 原文

护你周全 2024-07-15 19:37:40

如果问题确实是如何检测 ASCII，那么 litb 的答案就很准确。然而，如果 san 知道如何确定文件是否包含文本，那么问题就会变得更加复杂。 ASCII 只是一种越来越不流行的表示文本的方式。 Unicode 系统 - UTF16、UTF32 和 UTF8 越来越受欢迎。理论上，可以通过检查前两个字节是否为 unicode 字节顺序标记 (BOM) 0xFEFF（如果字节顺序相反则为 0xFFFE）来轻松测试它们。然而，由于这两个字节搞乱了 Linux 系统的许多文件格式，因此不能保证它们存在。此外，二进制文件可能以 0xFEFF 开头。

如果文件是 unicode，则查找 0x00（或其他控制字符）也无济于事。如果文件是 UFT16，并且文件包含英文文本，则每隔一个字符将为 0x00。

如果您知道写入文本文件的语言，则可以分析字节并统计确定它是否包含文本。例如，最常见的英语字母是 E，后跟 T。因此，如果文件包含的 E 和 T 多于 Z 和 X，则它很可能是文本。当然，有必要将其作为 ASCII 和各种 unicode 进行测试来确定。

如果该文件不是用英语编写的 - 或者您想支持多种语言 - 那么剩下的唯一两个选择是查看 Windows 上的文件扩展名并根据“魔术文件”代码数据库检查前四个字节确定文件的类型以及它是否包含文本。

回复收藏 0 原文

天气好吗我好吗 2024-07-15 19:37:40

嗯，这取决于你对 ASCII 的定义。您可以检查 ASCII 代码 <128 的值或您定义的某些字符集（例如“a”-“z”、“A”-“Z”、“0”-“9”...）并处理如果文件包含其他一些字符，则将其视为二进制文件。

您还可以检查常规换行符（0x10 或 0x13,0x10）来检测文本文件。

回复收藏 0 原文

夏尔 2024-07-15 19:37:40

要进行检查，您必须以二进制形式打开该文件。您无法以文本形式打开该文件。 ASCII 实际上是二进制的子集。
之后，您必须检查字节值。 ASCII 的字节值是 0-127，但 0-31 是控制字符。 TAB、CR 和 LF 是唯一的通用控制字符。
您不能（可移植）使用“A”和“Z”；不能保证它们是 ASCII 格式的 (!)。
如果您需要它们，则必须定义它们。

const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;

To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary.
After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters.
You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!).
If you need them, you'll have to define.

const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;

回复收藏 0 原文

吃素的狼 2024-07-15 19:37:40

这个问题确实没有正确或错误的答案，只有复杂的解决方案，不适用于所有可能的文本文件。

这是旧新事物文章的链接关于记事本如何检测ascii文件的类型。它并不完美，但看看微软如何处理它很有趣。

回复收藏 0 原文

梨涡 2024-07-15 19:37:40

Github 的语言学家使用 charlock holmes 库来检测二进制文件，而二进制文件又使用 ICU< /a> 的字符集检测。

ICU 库可用于多种编程语言，包括 C 和 Java。

回复收藏 0 原文

走过海棠暮 2024-07-15 19:37:40

bool checkFileASCIIFormat(std::string fileName)
{
    bool ascii = true;
    std::ifstream read(fileName);
    int line;
    while ((ascii) && (!read.eof())) {
        line = read.get();
        if (line > 127) {
            //ASCII codes only go up to 127
            ascii = false;
        }
    }

    return ascii;
}

bool checkFileASCIIFormat(std::string fileName)
{
    bool ascii = true;
    std::ifstream read(fileName);
    int line;
    while ((ascii) && (!read.eof())) {
        line = read.get();
        if (line > 127) {
            //ASCII codes only go up to 127
            ascii = false;
        }
    }

    return ascii;
}

回复收藏 0 原文

~没有更多了~