当前位置：文江博客话题详情

确定不带 BOM 的文本文件是 UTF8 还是 ASCII

发布于 2024-10-14 13:49:33 字数 462 浏览 12 评论 0原文

长话短说：
+ 我正在使用 ffmpeg 来检查 MP3 文件的艺术家姓名。
+ 如果艺术家的名字中有亚洲字符，则输出为 UTF8。
+ 如果只有 ASCII 字符，则输出为 ASCII。

输出在开头不使用任何 BOM 指示。

问题是，如果艺术家的名称中有“ä”，那么它是 ASCII，而不是 US-ASCII，因此“ä”不是有效的 UTF8 且会被跳过。

如何判断 ffmpeg 的输出文本文件是否为 UTF8？该应用程序没有任何开关，我只是认为不总是使用 UTF8 是很愚蠢的。：/

这样的东西将是完美的：

http://linux.die.net/man/1/ isutf8

如果有人知道 Windows 版本吗？

非常感谢各位！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦途 2024-10-21 13:49:33

此程序/源代码可能会帮助您：

检测传入和传出的编码

检测不带 BOM（字节顺序掩码）的文本的编码并选择最佳编码...

回复收藏 0 原文

稍尽春風 2024-10-21 13:49:33

你说，“ä”不是有效的 UTF-8 ...这是不正确的...
看来你对UTF-8是什么还没有清楚的了解。 UTF-8 是一种如何编码Unicode 代码点的系统。有效性问题不在于字符本身，而是如何编码的问题...
有许多系统可以编码 Unicode 代码点； UTF-8 是一种，UTF16 是另一种... "ä" 在 UTF-8 系统中相当合法。实际上所有字符都是有效的，只要该字符具有 Unicode 代码点即可。

然而，ASCII 只有 128 个有效值，相当于 Unicode 代码点系统中的前 128 个字符。 Unicode 本身只不过是一个大的查找表。编码系统的工作是什么？例如。 UTF-8。

因为 128 个 ASCII 字符与前 128 个 Unicode 字符相同，并且因为 UTF-8 可以将这 128 个值表示为单个字节，就像 ASCII 一样，这意味着 ASCII 文件中的数据与日期相同但称为 UTF-8 文件的文件相同。简单地说：ASCII 是 UTF-8 的子集...它们对于 ASCII 范围（即 128 个字符）内的数据无法区分。

您可以检查文件是否符合 7 位 ASCII 合规性。

# If nothing is output to stdout, the file is 7-bit ASCII compliant 
# Output lines containing ERROR chars -- to stdout

  perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"

以下是类似的 UTF-8 合规性检查。

perl -l -ne '/
   ^( ([\x00-\x7F])              # 1-byte pattern
     |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
     |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
     |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
    )*$ /x or print' "$1"

You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.

However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.

Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).

You can check a file for 7-bit ASCII compliance..

# If nothing is output to stdout, the file is 7-bit ASCII compliant 
# Output lines containing ERROR chars -- to stdout

  perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"

Here is a similar check for UTF-8 compliance..

perl -l -ne '/
   ^( ([\x00-\x7F])              # 1-byte pattern
     |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
     |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
     |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
    )*$ /x or print' "$1"

回复收藏 0 原文

~没有更多了~