检测文件是二进制文件还是纯文本文件?
如何检测文件是二进制文件还是纯文本文件?
基本上,我的 .NET 应用程序正在处理批处理文件并提取数据,但我不想处理二进制文件。
作为一种解决方案,我正在考虑分析文件的前 X 个字节,如果不可打印的字符多于可打印的字符,则它应该是二进制的。
这是正确的做法吗?这个任务有没有更好的实现方式?
How can I detect if a file is binary or a plain text?
Basically my .NET app is processing batch files and extracting data however I don't want to process binary files.
As a solution I'm thinking about analysing first X bytes of the file and if there are more unprintable characters than printable characters it should be binary.
Is this the right way to do it? Is there any better implementation for this task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以正则表达式前 X 个字节,如果所有字节都在正确的 字符类。但这可能前提是您知道编码。
You could regex the first X number of bytes, and give a valid match if all bytes are in a proper character class. But that might presuppose that you know the encoding.
我认为最好的方法是从文件中获取最多前 X 个字节(X 可以是 256、512 等),计算 ASCII 文件未使用的字符数(允许的 ASCII 代码为:10 , 13, 32-126)。如果您确定剧本是用英语写的,那么任何角色都不能超出上述集合。如果您不确定语言,那么您最多可以允许 Y 个字符在集合之外(如果 X 是 512,我会选择 Y 为 8 或 10)。
如果这还不够好,您可以使用更多约束,例如:根据文件的语法,应该存在此类关键字(例如:对于批处理文件,应该有一些 echo、for、if、goto、call、退出等)
I think the best way of doing this is to take at most the first X bytes from the file (X could be 256, 512, etc), count the number of chars that are not used by ASCII files (ascii codes permitted are: 10, 13, 32-126). If you know for sure that the script is written in English, than no character can be outside of the mentioned set. If you are not sure about the language, than you may permit at most Y char to be outside of the set (if X is 512, I would choose Y to be 8 or 10).
If this is not good enough, you may use more constraints such as: depending on the syntax of the files, such keywords should be present (eg: for your batch files, there should be some echo, for, if, goto, call, exit, etc)
Unix
文件
命令以一种巧妙的方式做到了这一点。当然,它的作用还有很多,但您可以此处检查算法,然后构建一些专门的东西。更新:上面的链接似乎已损坏。尝试这个。
Unix
file
command does this in a clever way. Of course, it does a lot more, but you can check the algorithm here and then build something specialized.UPDATE: The link above seems to be broken. Try this.
二进制到底是什么意思? 《孙子兵法》对你来说是用中文二进制写的吗?日英词典怎么样?
没有真正100%的方法。
您需要使用某种启发式方法。
一些选项可能需要查看:
如果上述(尤其是文件签名和扩展名)没有帮助,那么尝试根据某些字节的存在/不存在进行猜测(就像您正在做的那样)。
注意:最好先检查扩展名/签名,因为您只需要读取几个字节/文件元数据,与实际读取整个文件相比,这会非常有效。
What exactly do you mean by binary? Is the 'Art of War' written in Chinese binary to you? What about a Japanese-English dictionary?
There is no really 100% way.
You would need to use some kind of heuristic.
Some options might be to look at:
If the above (especially file signatures and extensions) don't help, then try to guess based on the presence/absence of certains bytes (like you are doing).
Note: It is better to check extensions/signatures first, as you would only need to read a few bytes/file metadata and that would be pretty efficient as compared to actually reading the whole file.