如何检查文件是否基于文本?
我正在开发一个小型文本替换应用程序,它基本上可以让用户选择一个文件并替换其中的文本,而无需打开文件本身。但是,我想确保该函数仅针对基于文本的文件运行。我以为可以通过检查文件的编码来完成此操作,但我发现记事本 .txt 文件使用 Unicode UTF-8 编码,MS Paint .bmp 文件也是如此。有没有一种简单的方法来检查这一点而不对文件扩展名本身施加限制?
I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
除非你从某处得到巨大的提示,否则你就会陷入困境。纯粹通过检查字节,考虑到过多的编码(“ASCII”、Unicode、UTF-8、DBCS、MBCS 等),您猜错的概率非零。哦,如果第一页碰巧看起来像 ASCII,但下一页是指向第一页的 btree 节点怎么办?
提示可以是:
Windows 曾经提供一个 API IsTextUnicode 来进行概率检查,但存在众所周知的误报。
我的看法是,试图比用户更聪明会遇到一些问题......
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
老实说,考虑到您正在使用的 Windows 环境,我会考虑使用已知文本格式的白名单。 Windows 用户通常接受过坚持使用扩展的培训。但是,我个人会放松它不适用于非文本文件的要求,而是与用户检查文件是否与内部白名单不匹配。如果您的搜索字符串很长,则更改二进制文件的风险将会降低 - 假设您没有执行 Y2K 转换(如
sed 's/y/k/g'
)。Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la
sed 's/y/k/g'
).确定文件是否基于文本(即二进制文件)的成本相当高。无论文件编码如何,您都必须检查文件中的每个字节以确定它是否是有效字符。
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
其他人说要查看文件中的所有字节,看看它们是否是字母数字。一些 UNIX/Linux 实用程序会这样做,但只是检查文件的前 1K 或 2K 作为“乐观优化”。
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
那么文本文件包含文本,对吧?因此,检查文件是否仅包含文本的一个非常简单的方法是读取它并检查它是否包含字母数字字符。
所以基本上你要做的第一件事就是检查文件编码是否是纯 ASCII 你有一个简单的任务只需将整个文件读入 char 数组(我假设你正在用 C/C++ 或类似的语言进行)并使用函数 isalpha 和 isdigit ...当然你必须注意特殊的例外,比如制表符 '\t' 空格' ' 或换行符(在 linux 中为 '\n',在 windows 中为 '\r'\'n')。
如果使用不同的编码,则过程是相同的,除了您必须使用不同的函数来检查当前是否是字符是字母数字字符...还要注意,在 UTF-16 或更大的情况下,简单的 char 数组太小...但是如果您在 C# 中执行此操作,则不必担心大小:)
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
您可以编写一个函数来尝试确定文件是否基于文本。虽然这不是 100% 准确,但对您来说可能已经足够了。这样的函数不需要遍历整个文件,大约 1 KB 就足够了(甚至更少)。要做的一件事是计算有多少空格和换行符。另一件事是考虑各个字节并检查它们是否是字母数字。通过一些实验,你应该能够想出一个像样的函数。请注意,这只是一种基本方法,文本编码可能会使事情变得复杂。
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.