当前位置：文江博客话题详情

如何检查字符串是 unicode 还是 ascii？

发布于 2024-10-17 08:54:21 字数 34 浏览 3 评论 0原文

在 Python 中我必须做什么才能找出字符串的编码？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

聆听风音 2024-10-24 08:54:21

在 Python 3 中，所有字符串都是 Unicode 字符序列。有一个 bytes 类型保存原始字节。

在 Python 2 中，字符串可以是 str 类型或 unicode 类型。您可以使用如下代码来区分：

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not区分“Unicode 或 ASCII”；它只区分Python类型。 Unicode 字符串可能仅由 ASCII 范围内的字符组成，而字节串可能包含 ASCII、编码的 Unicode 甚至非文本数据。

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

回复收藏 0 原文

橘亓 2024-10-24 08:54:21

如何判断一个对象是unicode字符串还是字节字符串

您可以使用type或isinstance。

在 Python 2 中：

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

在 Python 2 中，str 只是一个字节序列。 Python不知道什么
它的编码是。 unicode 类型是存储文本的更安全的方式。
如果您想更多地了解这一点，我推荐 http://farmdev.com/talks/unicode/ 。

在Python 3中：

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

在Python 3中，str类似于Python 2的unicode，用于
存储文本。在 Python 2 中称为 str 的内容在 Python 3 中称为 bytes。

如何判断字节字符串是有效的 utf-8 还是 ascii

您可以调用 decode< /代码>。如果它引发 UnicodeDecodeError 异常，则它无效。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

How to tell if an object is a unicode string or a byte string

You can use type or isinstance.

In Python 2:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

In Python 2, str is just a sequence of bytes. Python doesn't know what
its encoding is. The unicode type is the safer way to store text.
If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

In Python 3:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

In Python 3, str is like Python 2's unicode, and is used to
store text. What was called str in Python 2 is called bytes in Python 3.

How to tell if a byte string is valid utf-8 or ascii

You can call decode. If it raises a UnicodeDecodeError exception, it wasn't valid.

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

回复收藏 0 原文

送你一个梦 2024-10-24 08:54:21

在 python 3.x 中，所有字符串都是 Unicode 字符序列。对 str （默认情况下意味着 unicode 字符串）进行 isinstance 检查就足够了。

isinstance(x, str)

关于Python 2.x，
大多数人似乎都在使用包含两个检查的 if 语句。一种用于 str，另一种用于 unicode。

如果你想检查是否有一个“类似字符串”的对象，并且只用一个语句，你可以执行以下操作：

isinstance(x, basestring)

In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.

isinstance(x, str)

With regards to python 2.x,
Most people seem to be using an if statement that has two checks. one for str and one for unicode.

If you want to check if you have a 'string-like' object all with one statement though, you can do the following:

isinstance(x, basestring)

回复收藏 0 原文

好菇凉咱不稀罕他 2024-10-24 08:54:21

Unicode 不是一种编码 - 引用 Kumar McMillan 的话：

如果 ASCII、UTF-8 和其他字节字符串是“文本”...
...那么 Unicode 就是“文本性”；
它是文本的抽象形式

请阅读 PyCon 2008 上 McMillan 的 Unicode In Python, Completely Demystified 演讲，它解释了一些事情比 Stack Overflow 上的大多数相关答案要好得多。

回复收藏 0 原文

故事↓在人 2024-10-24 08:54:21

如果您的代码需要同时兼容 Python 2 和 Python 3，则不能直接使用 isinstance(s,bytes) 或 isinstance(s ,unicode) 而不将它们包装在 try/ except 或 python 版本测试中，因为 bytes 在 Python 2 中未定义，而 unicode 在 Python 3 中未定义。

有一些丑陋的解决方法。一种极其丑陋的方法是比较类型的名称，而不是比较类型本身。这里有一个例子：

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

一个可以说稍微不那么丑陋的解决方法是检查Python版本号，例如：

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

这些都是unpythonic的，并且大多数时候可能有更好的方法。

If your code needs to be compatible with both Python 2 and Python 3, you can't directly use things like isinstance(s,bytes) or isinstance(s,unicode) without wrapping them in either try/except or a python version test, because bytes is undefined in Python 2 and unicode is undefined in Python 3.

There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here's an example:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

Those are both unpythonic, and most of the time there's probably a better way.

回复收藏 0 原文

无声情话 2024-10-24 08:54:21

use:

import six
if isinstance(obj, six.text_type)

在六个库中它表示为：

if PY3:
    string_types = str,
else:
    string_types = basestring,

use:

import six
if isinstance(obj, six.text_type)

inside the six library it is represented as:

if PY3:
    string_types = str,
else:
    string_types = basestring,

回复收藏 0 原文

小巷里的女流氓 2024-10-24 08:54:21

请注意，在 Python 3 上，说以下任何一个都是不公平的：

str 对于任何 x 都是 UTFx（例如 UTF8）
str 是 Unicode
str s 是 Unicode 字符的有序集合

Python 的 str 类型（通常）是 Unicode 代码点的序列，其中一些代码点映射到字符。

即使在 Python 3 上，回答这个问题也不像您想象的那么简单。

测试 ASCII 兼容字符串的一个明显方法是尝试编码：

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

错误会区分大小写。

在Python 3中，甚至有一些字符串包含无效的Unicode代码点：

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

Note that on Python 3, it's not really fair to say any of:

strs are UTFx for any x (eg. UTF8)
strs are Unicode
strs are ordered collections of Unicode characters

Python's str type is (normally) a sequence of Unicode code points, some of which map to characters.

Even on Python 3, it's not as simple to answer this question as you might imagine.

An obvious way to test for ASCII-compatible strings is by an attempted encode:

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

The error distinguishes the cases.

In Python 3, there are even some strings that contain invalid Unicode code points:

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

The same method to distinguish them is used.

回复收藏 0 原文

维持三分热 2024-10-24 08:54:21

这可能对其他人有帮助，我开始测试变量 s 的字符串类型，但对于我的应用程序来说，简单地将 s 作为 utf-8 返回更有意义。调用 return_utf 的进程就知道它正在处理什么并且可以适当地处理该字符串。该代码不是原始的，但我希望它与 Python 版本无关，无需进行版本测试或导入 6 个版本。请对下面的示例代码进行评论以帮助其他人。

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

回复收藏 0 原文

真心难拥有 2024-10-24 08:54:21

在Python-3中，我必须了解字符串是否类似于 b='\x7f\x00\x00\x01' 或 b='127.0.0.1' 我的解决方案是就像这样：

def get_str(value):
    str_value = str(value)
    
    if str_value.isprintable():
        return str_value

    return '.'.join(['%d' % x for x in value])

为我工作，我希望为需要的人工作

In Python-3, I had to understand if string is like b='\x7f\x00\x00\x01' or b='127.0.0.1' My solution is like that:

def get_str(value):
    str_value = str(value)
    
    if str_value.isprintable():
        return str_value

    return '.'.join(['%d' % x for x in value])

Worked for me, I hope works for someone needed

回复收藏 0 原文

多情癖 2024-10-24 08:54:21

您可以使用通用编码检测器，但请注意，它只会给您最好的猜测，而不是实际编码，因为例如不可能知道字符串“abc”的编码。您将需要在其他地方获取编码信息，例如 HTTP 协议为此使用 Content-Type 标头。

回复收藏 0 原文

尛丟丟 2024-10-24 08:54:21

对于 py2/py3 兼容性，只需使用

进口六 if isinstance(obj, Six.text_type)

回复收藏 0 原文

深海不蓝 2024-10-24 08:54:21

一种简单的方法是检查 unicode 是否是内置函数。如果是这样，那么您使用的是 Python 2，并且您的字符串将是一个字符串。为了确保一切都在 unicode 中，可以这样做：

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

One simple approach is to check if unicode is a builtin function. If so, you're in Python 2 and your string will be a string. To ensure everything is in unicode one can do:

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

回复收藏 0 原文

静谧 2024-10-24 08:54:21

如果字符串包含任何 Unicode 字符，'Shōgun'.encode('ASCII') 将引发异常。您可以使用 unidecode 将 Unicode 字符转换为 ASCII。

import sys
import traceback
import types

import unidecode


unicode_string = 'Shōgun'


def unicode_to_ascii(string: str):
    try:
        string.encode(encoding = 'ASCII')
    except UnicodeEncodeError:
        exc_type, exc_value, exc_traceback = sys.exc_info()  # type: type(UnicodeEncodeError), UnicodeEncodeError, types.TracebackType
        traceback.print_exception(exc_type, exc_value, exc_traceback)
        print()
        return unidecode.unidecode_expect_ascii(string)
    else:
        return string


print(unicode_to_ascii(string = unicode_string))

Returns:

Traceback (most recent call last):
  File "C:\Users\phpjunkie\Python\Scripts\debug\unicode.py", line 10, in unicode_to_ascii
    string.encode(encoding = 'ASCII')
UnicodeEncodeError: 'ascii' codec can't encode character '\u014d' in position 2: ordinal not in range(128)

Shogun

'Shōgun'.encode('ASCII') will raise an exception if the string contains any Unicode characters. You can use unidecode to convert the Unicode character to ASCII.

import sys
import traceback
import types

import unidecode


unicode_string = 'Shōgun'


def unicode_to_ascii(string: str):
    try:
        string.encode(encoding = 'ASCII')
    except UnicodeEncodeError:
        exc_type, exc_value, exc_traceback = sys.exc_info()  # type: type(UnicodeEncodeError), UnicodeEncodeError, types.TracebackType
        traceback.print_exception(exc_type, exc_value, exc_traceback)
        print()
        return unidecode.unidecode_expect_ascii(string)
    else:
        return string


print(unicode_to_ascii(string = unicode_string))

Returns:

Traceback (most recent call last):
  File "C:\Users\phpjunkie\Python\Scripts\debug\unicode.py", line 10, in unicode_to_ascii
    string.encode(encoding = 'ASCII')
UnicodeEncodeError: 'ascii' codec can't encode character '\u014d' in position 2: ordinal not in range(128)

Shogun

回复收藏 0 原文

~没有更多了~