如何知道文件中有哪些特殊字符?
我的应用程序需要在批处理过程中处理文本文件。有时我会收到一个文件末尾带有一些特殊字符的文件。我不确定那个特殊字符是什么。无论如何,我是否可以找到该角色是什么,以便我可以告诉正在生成该文件的其他团队。
我已经使用 mozilla 的库来猜测文件编码,它显示 UTF-8。
My app needs to process text files during a batch process. Occassionally I receive a file with some special character at the end of the file. I am not sure what that special character is. Is there anyway I can find what that character is so that I can tell the other team which is producing that file.
I have used mozilla's library to guess the file encoding and it says UTF-8.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,这个角色是否真的“特殊”取决于你所说的“特殊角色”。作为 Unix 和 OS X 上的旁注,您可以使用例如 od、file 和 hexdump 命令来轻松检查文件:
现在,如果你知道你的文件编码是UTF-8,这意味着最高位设置为零的每个字节都对应于一个字符(在上面的示例中,最后一个字节是“0a”,这意味着“0a”字节对应于一个“字符”)。
UTF-8 格式的文件还意味着最高位设置为 1 的每个字节都是多字节字符的一部分。例如,在以下字节序列中:
唯一具有最高位设置的三个字节是“e2 80 a6”(从 0x80 到 0xFF 的所有值都具有最左边/最高位设置)并且它们是同一字符的一部分(UTF-8 中的非 ASCII 字符不能仅由一个最高位已设置的字节组成,因此您知道这三个字节是同一字符的一部分......事实上,每个 UTF-8 字节的最左边/最高位设置是恕我直言,UTF-8 的一个真正美丽的功能)。
现在,您在 Google 上搜索“e2 80 a6”,您会发现它是名为“horizontal ellipsis”的 Unicode 字符(其代码点在 UTF-8 中由十六进制 e280a6 表示)。
所以基本上你必须做两件事:
查找哪些字节组成了最后一个“特殊”字符(它只是一个字节还是几个字节?)
查找此/这些字节对应的“特殊字符”
First, if the character is really "special" or not depends what you call a "special character". As a sidenote on Unix and OS X you can use, for example, the od, file and hexdump commands to easily examine files:
Now if you know your file encoding is UTF-8, it means that every byte that has its highest bit set to zero correspond to exactly one character (in the example above, last byte is '0a', which means the '0a' byte correspond to one "character").
A file in UTF-8 also means that every byte whose highest bit is set to 1 is part of a multi-byte character. For example, in the following byte sequence:
the only three bytes that have their highest bit set are "e2 80 a6" (all the values from 0x80 to 0xFF have their leftmost/highest bit set) and they're part of the same character (you cannot have a non-ASCII character in UTF-8 made of only one byte whose highest bit is set, hence you know that these three bytes are part of the same character... The fact that every UTF-8 byte whose leftmost/highest bit is set is IMHO a truly beautiful feature of UTF-8).
Now you Google on "e2 80 a6" and you see that it's the Unicode character named "horizontal ellipsis" (whose codepoint, in UTF-8, is represented by hexadecimal e280a6).
So basically you have to do two things:
find which bytes are making up that last "special" character (is it just one byte or several bytes?)
find to which "special character" this/these byte(s) corresponds
任何十六进制编辑器都应该允许您查看文件中的每个单独的字节。这应该可以让你告诉他们这是什么角色。
这是我过去使用过的一个:http://www.hexworkshop.com/
Any hex editor ought to allow you to see each individual byte in a file. This ought to allow you to tell them what character it is.
Here's one I've used in the past: http://www.hexworkshop.com/
在 Unix 上,您可以使用 od 实用程序在文件或流中输出字节数据的多种表示形式。
On Unix, you can use the
od
utility to output several representations of byte data in a file or stream.