当前位置：文江博客话题详情

如何知道文件中有哪些特殊字符？

发布于 2024-08-30 09:14:56 字数 149 浏览 3 评论 0原文

我的应用程序需要在批处理过程中处理文本文件。有时我会收到一个文件末尾带有一些特殊字符的文件。我不确定那个特殊字符是什么。无论如何，我是否可以找到该角色是什么，以便我可以告诉正在生成该文件的其他团队。

我已经使用 mozilla 的库来猜测文件编码，它显示 UTF-8。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谢绝鈎搭 2024-09-06 09:14:56

首先，这个角色是否真的“特殊”取决于你所说的“特殊角色”。作为 Unix 和 OS X 上的旁注，您可以使用例如 od、file 和 hexdump 命令来轻松检查文件：

... $  hexdump -C example.txt 
00000530  6f 77 73 20 61 63 74 69  6f 6e 2e 0a 0a 0a 0a     |ows action.....|

现在，如果你知道你的文件编码是UTF-8，这意味着最高位设置为零的每个字节都对应于一个字符（在上面的示例中，最后一个字节是“0a”，这意味着“0a”字节对应于一个“字符”）。

UTF-8 格式的文件还意味着最高位设置为 1 的每个字节都是多字节字符的一部分。例如，在以下字节序列中：

75 20 5b e2 80 a6 5d 20  61 75 74 6f 72 69 73 61

唯一具有最高位设置的三个字节是“e2 80 a6”（从 0x80 到 0xFF 的所有值都具有最左边/最高位设置）并且它们是同一字符的一部分（UTF-8 中的非 ASCII 字符不能仅由一个最高位已设置的字节组成，因此您知道这三个字节是同一字符的一部分......事实上，每个 UTF-8 字节的最左边/最高位设置是恕我直言，UTF-8 的一个真正美丽的功能）。

现在，您在 Google 上搜索“e2 80 a6”，您会发现它是名为“horizontal ellipsis”的 Unicode 字符（其代码点在 UTF-8 中由十六进制 e280a6 表示）。

所以基本上你必须做两件事：

查找哪些字节组成了最后一个“特殊”字符（它只是一个字节还是几个字节？）
查找此/这些字节对应的“特殊字符”

First, if the character is really "special" or not depends what you call a "special character". As a sidenote on Unix and OS X you can use, for example, the od, file and hexdump commands to easily examine files:

... $  hexdump -C example.txt 
00000530  6f 77 73 20 61 63 74 69  6f 6e 2e 0a 0a 0a 0a     |ows action.....|

Now if you know your file encoding is UTF-8, it means that every byte that has its highest bit set to zero correspond to exactly one character (in the example above, last byte is '0a', which means the '0a' byte correspond to one "character").

A file in UTF-8 also means that every byte whose highest bit is set to 1 is part of a multi-byte character. For example, in the following byte sequence:

75 20 5b e2 80 a6 5d 20  61 75 74 6f 72 69 73 61

the only three bytes that have their highest bit set are "e2 80 a6" (all the values from 0x80 to 0xFF have their leftmost/highest bit set) and they're part of the same character (you cannot have a non-ASCII character in UTF-8 made of only one byte whose highest bit is set, hence you know that these three bytes are part of the same character... The fact that every UTF-8 byte whose leftmost/highest bit is set is IMHO a truly beautiful feature of UTF-8).

Now you Google on "e2 80 a6" and you see that it's the Unicode character named "horizontal ellipsis" (whose codepoint, in UTF-8, is represented by hexadecimal e280a6).

So basically you have to do two things:

find which bytes are making up that last "special" character (is it just one byte or several bytes?)
find to which "special character" this/these byte(s) corresponds

回复收藏 0 原文