当前位置：文江博客话题详情

检查所有文件是否编码为 UTF-8

发布于 2024-08-11 07:37:31 字数 95 浏览 13 评论 0原文

有谁知道有一个 Windows 应用程序可以扫描目录并检查哪些脚本编码为/未编码为指定的字符集（在本例中为 UTF-8）？我可以手动完成，但这可能需要一段时间，而且很容易出错！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

等待圉鍢 2024-08-18 07:37:31

UTF-8 不是字符集，而是 Unicode 字符的编码。而且，由于这与编程无关，因此我将其移交给超级用户。

如果您确实想要编写一个程序来检测这些序列，那非常简单：

Illegal UTF-8 initial sequences

UTF-8 Sequence       Reason for Illegality 
10xxxxxx             illegal as initial byte of character (80..BF) 
1100000x             illegal, overlong (C0 80..BF) 
11100000  100xxxxx   illegal, overlong (E0 80..9F) 
11110000  1000xxxx   illegal, overlong (F0 80..8F) 
11111000  10000xxx   illegal, overlong (F8 80..87) 
11111100  100000xx   illegal, overlong (FC 80..83) 
1111111x             illegal; prohibited by spec

然后，假设第一个八位字节是合法的，只需记住形成代码点的八位字节的数量可以通过计算第一个 0 位之前的 1 位的数量。

例如，11110xxx 是 4 个八位字节序列的开始，因此一旦确定其合法性，您就应该向前跳过 4 个八位字节。

另一件要做的事情是确保所有连续八位字节都以10开头。

UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.

If you do want to write a program for detecting those sequences, it's pretty easy:

Illegal UTF-8 initial sequences

UTF-8 Sequence       Reason for Illegality 
10xxxxxx             illegal as initial byte of character (80..BF) 
1100000x             illegal, overlong (C0 80..BF) 
11100000  100xxxxx   illegal, overlong (E0 80..9F) 
11110000  1000xxxx   illegal, overlong (F0 80..8F) 
11111000  10000xxx   illegal, overlong (F8 80..87) 
11111100  100000xx   illegal, overlong (FC 80..83) 
1111111x             illegal; prohibited by spec

Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1 bits before the first 0 bit.

For example, 11110xxx is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.

The other thing to do is ensure that all continuation octets start with 10.

回复收藏 0 原文