诱变剂和 id3 标签 - 字符编码混乱
我在读取一些带有冰岛字母的 id3 标签时遇到了问题。
来自 shell 的快速示例。
>>> audio = mutagen.easyid3.EasyID3('./Björk/Albums/1990 - Gling-Gló [mp3-231]/01 - Gling-Gló.mp3')
>>> audio['title']
5: [u'Gling-Gl\xf3']
首先,我不太确定如何检查标签所在的字符编码。根据我收集的信息,这是使用诱变剂执行此操作的方法:
>>> audio = mutagen.id3.ID3('./Björk/Albums/1990 - Gling-Gló [mp3-231]/01 - Gling-Gló.mp3')
>>> for key, value in audio.items():
... print value.encoding
这为每个项目输出“0”。
我在某处看到,对于 id3 标签,数字 0 表示字符串是 iso-8859-1 编码的,但我不知道从哪里开始。我想这不对吧?
>>> audio.get('artist')[0].decode('iso-8859-1')
14: u'Bj\xc3\xb6rk'
正如您可以说的那样,当涉及到字符编码问题时,我非常困惑。
我想要的只是将标签捕获为正确的 utf-8 字符串,以便我可以将它们放入我的数据库中。 但这只是一个例子,我想我可能会遇到一些具有完全不同编码的其他文件,所以我正在寻找一种好的全面解决方案。只要解决这个问题确实可以帮助我走上正轨。
提前致谢。
I've run into a problem when reading some id3 tags with Icelandic letters.
A quick example from the shell.
>>> audio = mutagen.easyid3.EasyID3('./Björk/Albums/1990 - Gling-Gló [mp3-231]/01 - Gling-Gló.mp3')
>>> audio['title']
5: [u'Gling-Gl\xf3']
First of all, I'm not really sure how to check which character encoding the tags are in. From what I've gathered, this is the way to do it with mutagen:
>>> audio = mutagen.id3.ID3('./Björk/Albums/1990 - Gling-Gló [mp3-231]/01 - Gling-Gló.mp3')
>>> for key, value in audio.items():
... print value.encoding
This outputs '0' for each item.
And I saw somewhere that for id3 tags, the number 0 meant the string is iso-8859-1 encoded, but I don't know where to go from there. I guess this isn't right?
>>> audio.get('artist')[0].decode('iso-8859-1')
14: u'Bj\xc3\xb6rk'
As you can propably tell I am seriously confuses when it comes to character encoding issues.
All I want is to capture the tags as proper utf-8 strings so I can put them in my database.
This is just one example though, I guess I'll probably run into some other files with completely different encodings so I'm looking for a good all around solution. Just fixing this would really help me get on the track though.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
欢迎来到有趣的编码世界。
在此步骤中:
...您最终会得到一个 unicode 字节字符串。在第二行中,Python 打印出该字节字符串的 ASCII 表示形式,这就是您看到十六进制值的原因。您需要的是 Python 获取该字节字符串并使用可用的字符编码之一对其进行编码。这也是我困惑的一个根源。请记住,您将字符解码为十六进制值,并将十六进制值编码为字符。
所以,如果你这样做:
嗯,这很烦人。你告诉它以 UTF-8 编码,但你仍然得到 ASCII。诀窍在于,在 Python 中执行此类调用只会输出输入内容的 ASCII 表示形式。如果将其更改为:
...您会看到正确的结果。因此,一旦您实际对新编码的文本执行某些操作,您就会看到它以您想要的方式表示。将其打印到控制台、写入文件或在 GUI 小部件中显示应该看起来不错。
Welcome to the fun world of encoding.
In this step:
...you end up with a unicode byte string. In the second line, Python is printing out an ASCII represntation of this byte string, which is why you see the hex values. What you need is for Python to take that byte string and encode it using one of the available character encodings. This was a source of confusion for me too. Just remember, you decode from the characters into the hex values and you encode the hex values into characters.
So, if you do this:
Well, that's annoying. You told it to encode in UTF-8 but you still got ASCII. The trick is that doing such a call in Python just outputs the ASCII representation of whatever the input was. If you change it to:
...you see the correct result. So, once you actually do something with the newly encoded text, you'll see it represented the way you want. Printing it to the console, writing it to a file, or displaying it in a GUI widget should look fine.
这对我有用
This is working for me