确定不带 BOM 的文本文件是 UTF8 还是 ASCII
长话短说:
+ 我正在使用 ffmpeg 来检查 MP3 文件的艺术家姓名。
+ 如果艺术家的名字中有亚洲字符,则输出为 UTF8。
+ 如果只有 ASCII 字符,则输出为 ASCII。
输出在开头不使用任何 BOM 指示。
问题是,如果艺术家的名称中有“ä”,那么它是 ASCII,而不是 US-ASCII,因此“ä”不是有效的 UTF8 且会被跳过。
如何判断 ffmpeg 的输出文本文件是否为 UTF8?该应用程序没有任何开关,我只是认为不总是使用 UTF8 是很愚蠢的。 :/
这样的东西将是完美的:
http://linux.die.net/man/1/ isutf8
如果有人知道 Windows 版本吗?
非常感谢各位!
Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
此程序/源代码可能会帮助您:
This program/source might help you:
你说,“ä”不是有效的 UTF-8 ...这是不正确的...
看来你对UTF-8是什么还没有清楚的了解。 UTF-8 是一种如何编码Unicode 代码点的系统。有效性问题不在于字符本身,而是如何编码的问题...
有许多系统可以编码 Unicode 代码点; UTF-8 是一种,UTF16 是另一种...
"ä"
在 UTF-8 系统中相当合法。实际上所有字符都是有效的,只要该字符具有 Unicode 代码点即可。然而,ASCII 只有 128 个有效值,相当于 Unicode 代码点系统中的前 128 个字符。 Unicode 本身只不过是一个大的查找表。编码系统的工作是什么?例如。 UTF-8。
因为 128 个 ASCII 字符与前 128 个 Unicode 字符相同,并且因为 UTF-8 可以将这 128 个值表示为单个字节,就像 ASCII 一样,这意味着 ASCII 文件中的数据 与日期相同但称为 UTF-8 文件的文件相同。简单地说:ASCII 是 UTF-8 的子集...它们对于 ASCII 范围(即 128 个字符)内的数据无法区分。
您可以检查文件是否符合 7 位 ASCII 合规性。
以下是类似的 UTF-8 合规性检查。
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another...
"ä"
is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
Here is a similar check for UTF-8 compliance..