如何检查字符串是 unicode 还是 ascii?
在 Python 中我必须做什么才能找出字符串的编码?
What do I have to do in Python to figure out which encoding a string has?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
在 Python 中我必须做什么才能找出字符串的编码?
What do I have to do in Python to figure out which encoding a string has?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(13)
在 Python 3 中,所有字符串都是 Unicode 字符序列。有一个
bytes
类型保存原始字节。在 Python 2 中,字符串可以是
str
类型或unicode
类型。您可以使用如下代码来区分:This does not区分“Unicode 或 ASCII”;它只区分Python类型。 Unicode 字符串可能仅由 ASCII 范围内的字符组成,而字节串可能包含 ASCII、编码的 Unicode 甚至非文本数据。
In Python 3, all strings are sequences of Unicode characters. There is a
bytes
type that holds raw bytes.In Python 2, a string may be of type
str
or of typeunicode
. You can tell which using code something like this:This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.
如何判断一个对象是unicode字符串还是字节字符串
您可以使用
type
或isinstance
。在 Python 2 中:
在 Python 2 中,
str
只是一个字节序列。 Python不知道什么它的编码是。
unicode
类型是存储文本的更安全的方式。如果您想更多地了解这一点,我推荐 http://farmdev.com/talks/unicode/ 。
在Python 3中:
在Python 3中,
str
类似于Python 2的unicode
,用于存储文本。在 Python 2 中称为
str
的内容在 Python 3 中称为bytes
。如何判断字节字符串是有效的 utf-8 还是 ascii
您可以调用
decode< /代码>。如果它引发 UnicodeDecodeError 异常,则它无效。
How to tell if an object is a unicode string or a byte string
You can use
type
orisinstance
.In Python 2:
In Python 2,
str
is just a sequence of bytes. Python doesn't know whatits encoding is. The
unicode
type is the safer way to store text.If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.
In Python 3:
In Python 3,
str
is like Python 2'sunicode
, and is used tostore text. What was called
str
in Python 2 is calledbytes
in Python 3.How to tell if a byte string is valid utf-8 or ascii
You can call
decode
. If it raises a UnicodeDecodeError exception, it wasn't valid.在 python 3.x 中,所有字符串都是 Unicode 字符序列。对 str (默认情况下意味着 unicode 字符串)进行 isinstance 检查就足够了。
关于Python 2.x,
大多数人似乎都在使用包含两个检查的 if 语句。一种用于 str,另一种用于 unicode。
如果你想检查是否有一个“类似字符串”的对象,并且只用一个语句,你可以执行以下操作:
In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.
With regards to python 2.x,
Most people seem to be using an if statement that has two checks. one for str and one for unicode.
If you want to check if you have a 'string-like' object all with one statement though, you can do the following:
Unicode 不是一种编码 - 引用 Kumar McMillan 的话:
请阅读 PyCon 2008 上 McMillan 的 Unicode In Python, Completely Demystified 演讲,它解释了一些事情比 Stack Overflow 上的大多数相关答案要好得多。
Unicode is not an encoding - to quote Kumar McMillan:
Have a read of McMillan's Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.
如果您的代码需要同时兼容 Python 2 和 Python 3,则不能直接使用
isinstance(s,bytes)
或isinstance(s ,unicode)
而不将它们包装在 try/ except 或 python 版本测试中,因为bytes
在 Python 2 中未定义,而unicode
在 Python 3 中未定义。有一些丑陋的解决方法。一种极其丑陋的方法是比较类型的名称,而不是比较类型本身。这里有一个例子:
一个可以说稍微不那么丑陋的解决方法是检查Python版本号,例如:
这些都是unpythonic的,并且大多数时候可能有更好的方法。
If your code needs to be compatible with both Python 2 and Python 3, you can't directly use things like
isinstance(s,bytes)
orisinstance(s,unicode)
without wrapping them in either try/except or a python version test, becausebytes
is undefined in Python 2 andunicode
is undefined in Python 3.There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here's an example:
An arguably slightly less ugly workaround is to check the Python version number, e.g.:
Those are both unpythonic, and most of the time there's probably a better way.
use:
在六个库中它表示为:
use:
inside the six library it is represented as:
请注意,在 Python 3 上,说以下任何一个都是不公平的:
str
对于任何 x 都是 UTFx(例如 UTF8)str
是 Unicodestr
s 是 Unicode 字符的有序集合Python 的
str
类型(通常)是 Unicode 代码点的序列,其中一些代码点映射到字符。即使在 Python 3 上,回答这个问题也不像您想象的那么简单。
测试 ASCII 兼容字符串的一个明显方法是尝试编码:
错误会区分大小写。
在Python 3中,甚至有一些字符串包含无效的Unicode代码点:
使用相同的方法来区分它们。
Note that on Python 3, it's not really fair to say any of:
str
s are UTFx for any x (eg. UTF8)str
s are Unicodestr
s are ordered collections of Unicode charactersPython's
str
type is (normally) a sequence of Unicode code points, some of which map to characters.Even on Python 3, it's not as simple to answer this question as you might imagine.
An obvious way to test for ASCII-compatible strings is by an attempted encode:
The error distinguishes the cases.
In Python 3, there are even some strings that contain invalid Unicode code points:
The same method to distinguish them is used.
这可能对其他人有帮助,我开始测试变量 s 的字符串类型,但对于我的应用程序来说,简单地将 s 作为 utf-8 返回更有意义。调用 return_utf 的进程就知道它正在处理什么并且可以适当地处理该字符串。该代码不是原始的,但我希望它与 Python 版本无关,无需进行版本测试或导入 6 个版本。请对下面的示例代码进行评论以帮助其他人。
This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.
在Python-3中,我必须了解字符串是否类似于
b='\x7f\x00\x00\x01'
或b='127.0.0.1'
我的解决方案是就像这样:为我工作,我希望为需要的人工作
In Python-3, I had to understand if string is like
b='\x7f\x00\x00\x01'
orb='127.0.0.1'
My solution is like that:Worked for me, I hope works for someone needed
您可以使用通用编码检测器,但请注意,它只会给您最好的猜测,而不是实际编码,因为例如不可能知道字符串“abc”的编码。您将需要在其他地方获取编码信息,例如 HTTP 协议为此使用 Content-Type 标头。
You could use Universal Encoding Detector, but be aware that it will just give you best guess, not the actual encoding, because it's impossible to know encoding of a string "abc" for example. You will need to get encoding information elsewhere, eg HTTP protocol uses Content-Type header for that.
对于 py2/py3 兼容性,只需使用
进口六
if isinstance(obj, Six.text_type)
For py2/py3 compatibility simply use
import six
if isinstance(obj, six.text_type)
一种简单的方法是检查
unicode
是否是内置函数。如果是这样,那么您使用的是 Python 2,并且您的字符串将是一个字符串。为了确保一切都在unicode
中,可以这样做:One simple approach is to check if
unicode
is a builtin function. If so, you're in Python 2 and your string will be a string. To ensure everything is inunicode
one can do:如果字符串包含任何 Unicode 字符,
'Shōgun'.encode('ASCII')
将引发异常。您可以使用 unidecode 将 Unicode 字符转换为 ASCII。Returns:
'Shōgun'.encode('ASCII')
will raise an exception if the string contains any Unicode characters. You can use unidecode to convert the Unicode character to ASCII.Returns: