检查所有文件是否编码为 UTF-8

发布于 2024-08-11 07:37:31 字数 95 浏览 10 评论 0原文

有谁知道有一个 Windows 应用程序可以扫描目录并检查哪些脚本编码为/未编码为指定的字符集(在本例中为 UTF-8)?我可以手动完成,但这可能需要一段时间,而且很容易出错!

Does anyone know of a Windows app that can scan through a directory and check which scripts are/aren't encoded as a specified charset (UTF-8 in this case)? I could do it manually, but that could take a while and is quite error prone!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

等待圉鍢 2024-08-18 07:37:31

UTF-8 不是字符集,而是 Unicode 字符的编码。而且,由于这与编程无关,因此我将其移交给超级用户。

如果您确实想要编写一个程序来检测这些序列,那非常简单:

Illegal UTF-8 initial sequences

UTF-8 Sequence       Reason for Illegality 
10xxxxxx             illegal as initial byte of character (80..BF) 
1100000x             illegal, overlong (C0 80..BF) 
11100000  100xxxxx   illegal, overlong (E0 80..9F) 
11110000  1000xxxx   illegal, overlong (F0 80..8F) 
11111000  10000xxx   illegal, overlong (F8 80..87) 
11111100  100000xx   illegal, overlong (FC 80..83) 
1111111x             illegal; prohibited by spec 

然后,假设第一个八位字节是合法的,只需记住形成代码点的八位字节的数量可以通过计算第一个 0 位之前的 1 位的数量。

例如,11110xxx 是 4 个八位字节序列的开始,因此一旦确定其合法性,您就应该向前跳过 4 个八位字节。

另一件要做的事情是确保所有连续八位字节都以10开头。

UTF-8 isn't a character set, it's an encoding for Unicode characters. And, since this is not programming related, I'm nudging it over to superuser.

If you do want to write a program for detecting those sequences, it's pretty easy:

Illegal UTF-8 initial sequences

UTF-8 Sequence       Reason for Illegality 
10xxxxxx             illegal as initial byte of character (80..BF) 
1100000x             illegal, overlong (C0 80..BF) 
11100000  100xxxxx   illegal, overlong (E0 80..9F) 
11110000  1000xxxx   illegal, overlong (F0 80..8F) 
11111000  10000xxx   illegal, overlong (F8 80..87) 
11111100  100000xx   illegal, overlong (FC 80..83) 
1111111x             illegal; prohibited by spec 

Then, provided the first octet is legal, just remember that the number of octets forming a code point can be obtained by counting the number of 1 bits before the first 0 bit.

For example, 11110xxx is the start of a 4-octet sequence so you should skip ahead 4 octets once you've established its legality.

The other thing to do is ensure that all continuation octets start with 10.

羁绊已千年 2024-08-18 07:37:31

不确定这是否是您正在寻找的内容,但我使用命令 shell for 循环并使用我的 hdump 实用程序转储每个文件的前几个字节,该实用程序显示文件的字节十六进制形式。然后,我在每个文件的开头查找前导 3 字节 UTF-8 签名(字节顺序标记)。

我的 hdump 实用程序位于:http://david.tribble.com/程序.html

Not sure if this is what you're looking for, but I use a command shell for-loop and dump the first few bytes of each file using my hdump utility, which displays the bytes of the file in hexadecimal form. I then look for the leading 3-byte UTF-8 signature (Byte Order Mark) at the start of each file.

My hdump utility is available at: http://david.tribble.com/programs.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文