确定不带 BOM 的文本文件是 UTF8 还是 ASCII

发布于 2024-10-14 13:49:33 字数 462 浏览 12 评论 0原文

长话短说:
+ 我正在使用 ffmpeg 来检查 MP3 文件的艺术家姓名。
+ 如果艺术家的名字中有亚洲字符,则输出为 UTF8。
+ 如果只有 ASCII 字符,则输出为 ASCII。

输出在开头不使用任何 BOM 指示。

问题是,如果艺术家的名称中有“ä”,那么它是 ASCII,而不是 US-ASCII,因此“ä”不是有效的 UTF8 且会被跳过。

如何判断 ffmpeg 的输出文本文件是否为 UTF8?该应用程序没有任何开关,我只是认为不总是使用 UTF8 是很愚蠢的。 :/

这样的东西将是完美的:

http://linux.die.net/man/1/ isutf8

如果有人知道 Windows 版本吗?

非常感谢各位!

Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.

The output does not use any BOM indication at the beginning.

The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.

How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/

Something like this would be perfect:

http://linux.die.net/man/1/isutf8

If anyone knows of a Windows version?

Thanks a lot in before hand guys!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦途 2024-10-21 13:49:33

此程序/源代码可能会帮助您:

检测不带 BOM(字节顺序掩码)的文本的编码并选择最佳编码...

This program/source might help you:

Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...

稍尽春風 2024-10-21 13:49:33

你说,“ä”不是有效的 UTF-8 ...这是不正确的...
看来你对UTF-8是什么还没有清楚的了解。 UTF-8 是一种如何编码Unicode 代码点的系统。有效性问题不在于字符本身,而是如何编码的问题...
有许多系统可以编码 Unicode 代码点; UTF-8 是一种,UTF16 是另一种... "ä" 在 UTF-8 系统中相当合法。实际上所有字符都是有效的,只要该字符具有 Unicode 代码点即可。

然而,ASCII 只有 128 个有效值,相当于 Unicode 代码点系统中的前 128 个字符。 Unicode 本身只不过是一个大的查找表。编码系统的工作是什么?例如。 UTF-8。

因为 128 个 ASCII 字符与前 128 个 Unicode 字符相同,并且因为 UTF-8 可以将这 128 个值表示为单个字节,就像 ASCII 一样,这意味着 ASCII 文件中的数据 与日期相同但称为 UTF-8 文件的文件相同。简单地说:ASCII 是 UTF-8 的子集...它们对于 ASCII 范围(即 128 个字符)内的数据无法区分。

您可以检查文件是否符合 7 位 ASCII 合规性。

# If nothing is output to stdout, the file is 7-bit ASCII compliant 
# Output lines containing ERROR chars -- to stdout

  perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"

以下是类似的 UTF-8 合规性检查。

perl -l -ne '/
   ^( ([\x00-\x7F])              # 1-byte pattern
     |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
     |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
     |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
    )*$ /x or print' "$1"

You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.

However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.

Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).

You can check a file for 7-bit ASCII compliance..

# If nothing is output to stdout, the file is 7-bit ASCII compliant 
# Output lines containing ERROR chars -- to stdout

  perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"

Here is a similar check for UTF-8 compliance..

perl -l -ne '/
   ^( ([\x00-\x7F])              # 1-byte pattern
     |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
     |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
     |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
    )*$ /x or print' "$1"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文