什么是python 3` str .__ getItem__ computional的复杂性?
''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()
'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: in some implementations only).
'''
my_char= s[453_452_345]
但是,许多人以这样的方式编写循环:
for i in range(len(s)):
do_something_with(s[i])
使用索引操作最多n次或更多。
Python3如何解决两个代码段中字符串中索引UTF-8字符的问题?
- 它总是为nth char(既简单&昂贵的分辨率)执行线性查找?
- 还是它存储了一些其他C指针来执行智能索引计算?
''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()
'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: in some implementations only).
'''
my_char= s[453_452_345]
However, many people write loops like this:
for i in range(len(s)):
do_something_with(s[i])
using indexing operation up to n times or more.
How does Python3 resolve the problem of indexing UTF-8 characters in strings for both code snippets?
- Does it always perform a linear look-up for nth char (which is both simple & expensive resolution)?
- Or maybe it stores some additional C pointers to perform smart index calculations?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
答:O(1)
Python字符串在内部不是UTF-8:在Python 3中,从任何外部源获取文本时,根据给定的编解码器对文本进行解码。该文本在大多数源/平台中解码默认值为UTF -8,但会相应地与SO的默认值进行相应的变化 - 无论如何,所有相关的“ text import” API,例如打开文件或连接到DB,允许您指定文本编码使用。
内部字符串根据文本字符串中的“最宽”编码点的需求,使用“ Latin-1”,“ UCS-2”或“ UCS-4”之一。
这是从Python 3.3开始的新事物(在此之前,所有内部字符串表示形式都将默认为32位UCS-4,即使对于仅ASCII-forly文本)。该规格记录在
因此,python只能在给定索引的情况下将正确的字符归零。
作为轶事,Luciano Ramalho(Fluent Python Book的作者)撰写了
LeanStr
,这是字符串类的学习用途实现,该类别将在内部持有UTF-8。当然,然后您担心__ getItem __
复杂性应用: https://github.com/ ramalho/leantstr不幸的(在这种情况下为Fortunatelly),即使继承,许多标准库和本机代码扩展也不会接受类似于
str
的类从str
中,并分开保持数据,重新实现了所有Dunder方法。但是,如果所有STR方法都到位,则任何涉及字符串的纯Python代码都应接受linestr
实例。其他实现:pypy
因此,碰巧的是,内部使用文本是“实施细节”,而pypy来自版本7.1 Onwards确实在内部使用UTF-8字符串作为其文本对象。
与Ramalho上面的天真“ Leanstr”不同,它们确实为每个第4个UTF-8 char保留一个索引,以便仍然可以在O(1)中添加索引访问。我没有找到任何文档,但是创建索引的代码为在这里。
我已经在Twiter上提到了这个问题,因为我是Ramalho的无罪释放,最终是PYPY开发人员之一的Carl Friederich Bolz-Terich回到了:
A: O(1)
Python strings are not utf-8 internally: in Python 3 when getting text from any external source, the text is decoded according to a given codec. This text decoding defaults to utf-8 in most sources/platforms, but varying accordingly to the S.O.'s default - anyway, all relevant "text importing" APIs, like opening a file, or connecting to a DB, allow you to specify the text encoding to use.
Inner strings use one of "Latin-1", "UCS-2" or "UCS-4" according to the needs of the "widest" codepoint in the text string.
This is new from Python 3.3 onwards (prior to that, all internal string representation would default to 32bit UCS-4, even for ASCII-only text). The spec is documented on PEP-393.
Therefore, Python can just zero-in the correct character given its index.
As an anecdote, Luciano Ramalho (author of Fluent Python book), wrote
Leanstr
, a learning-purpose implementation of a string class that will hold utf-8 internally. Of course, then your worries about__getitem__
complexity apply: https://github.com/ramalho/leanstrUnfortunatelly, (or fortunatelly, in this case), a lot of the standard library and native code extensions to Python will not accept a class similar to
str
, even if it inherits fromstr
and keeps its data separetely, re-implementing all dunder methods. But if all str methods are in place, any pure-python code dealing with strings should accept aLeanStr
instance.Other implementations: Pypy
So, it happens that how text is used internally is an "implementation detail", and Pypy from version 7.1 onwards does use utf-8 byte strings internally for its text objects.
Unlike Ramalho's naive "leanstr" above, however, they do keep an index for each 4th utf-8 char so that char access by index can still be made in O(1). I did not find any docs about it, but the code for creating the index is here.
I've mentioned this question on twiter, as I am an acquittance of Ramalho, and eventually Carl Friederich Bolz-Terich, one of Pypy developers, reached back:
Tweet