什么是python 3` str . getItem computional的复杂性？

发布于 2025-02-02 02:59:05 字数 519 浏览 1 评论 0原文

''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()

'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: in some implementations only).
'''
my_char= s[453_452_345]

但是，许多人以这样的方式编写循环：

for i in range(len(s)):
    do_something_with(s[i])

使用索引操作最多n次或更多。

Python3如何解决两个代码段中字符串中索引UTF-8字符的问题？

它总是为nth char（既简单＆amp;昂贵的分辨率）执行线性查找？
还是它存储了一些其他C指针来执行智能索引计算？

原文

''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()

'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: in some implementations only).
'''
my_char= s[453_452_345]

However, many people write loops like this:

for i in range(len(s)):
    do_something_with(s[i])

using indexing operation up to n times or more.

How does Python3 resolve the problem of indexing UTF-8 characters in strings for both code snippets?

Does it always perform a linear look-up for nth char (which is both simple & expensive resolution)?
Or maybe it stores some additional C pointers to perform smart index calculations?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我偏爱纯白色 2025-02-09 02:59:05

什么是python 3 str .__ getItem __ computional复杂性？

答：O（1）

Python字符串在内部不是UTF-8：在Python 3中，从任何外部源获取文本时，根据给定的编解码器对文本进行解码。该文本在大多数源/平台中解码默认值为UTF -8，但会相应地与SO的默认值进行相应的变化 - 无论如何，所有相关的“ text import” API，例如打开文件或连接到DB，允许您指定文本编码使用。

内部字符串根据文本字符串中的“最宽”编码点的需求，使用“ Latin-1”，“ UCS-2”或“ UCS-4”之一。

这是从Python 3.3开始的新事物（在此之前，所有内部字符串表示形式都将默认为32位UCS-4，即使对于仅ASCII-forly文本）。该规格记录在

因此，python只能在给定索引的情况下将正确的字符归零。

作为轶事，Luciano Ramalho（Fluent Python Book的作者）撰写了LeanStr，这是字符串类的学习用途实现，该类别将在内部持有UTF-8。当然，然后您担心__ getItem __复杂性应用： https://github.com/ ramalho/leantstr

不幸的（在这种情况下为Fortunatelly），即使继承，许多标准库和本机代码扩展也不会接受类似于str的类从str中，并分开保持数据，重新实现了所有Dunder方法。但是，如果所有STR方法都到位，则任何涉及字符串的纯Python代码都应接受linestr实例。

其他实现：pypy

因此，碰巧的是，内部使用文本是“实施细节”，而pypy来自版本7.1 Onwards确实在内部使用UTF-8字符串作为其文本对象。

与Ramalho上面的天真“ Leanstr”不同，它们确实为每个第4个UTF-8 char保留一个索引，以便仍然可以在O（1）中添加索引访问。我没有找到任何文档，但是创建索引的代码为在这里。

我已经在Twiter上提到了这个问题，因为我是Ramalho的无罪释放，最终是PYPY开发人员之一的Carl Friederich Bolz-Terich回到了：

它对我们来说真的很好！大多数Unicode字符串不需要此索引，零副本UTF-8解码非常酷。最烦人的实际上是str.find，因为您需要从字节索引到char索引的反向转换。我们没有索引。

What is Python 3 str.__getitem__ computional complexity?

A: O(1)

Python strings are not utf-8 internally: in Python 3 when getting text from any external source, the text is decoded according to a given codec. This text decoding defaults to utf-8 in most sources/platforms, but varying accordingly to the S.O.'s default - anyway, all relevant "text importing" APIs, like opening a file, or connecting to a DB, allow you to specify the text encoding to use.

Inner strings use one of "Latin-1", "UCS-2" or "UCS-4" according to the needs of the "widest" codepoint in the text string.

This is new from Python 3.3 onwards (prior to that, all internal string representation would default to 32bit UCS-4, even for ASCII-only text). The spec is documented on PEP-393.

Therefore, Python can just zero-in the correct character given its index.

As an anecdote, Luciano Ramalho (author of Fluent Python book), wrote Leanstr, a learning-purpose implementation of a string class that will hold utf-8 internally. Of course, then your worries about __getitem__ complexity apply: https://github.com/ramalho/leanstr

Unfortunatelly, (or fortunatelly, in this case), a lot of the standard library and native code extensions to Python will not accept a class similar to str, even if it inherits from str and keeps its data separetely, re-implementing all dunder methods. But if all str methods are in place, any pure-python code dealing with strings should accept a LeanStr instance.

Other implementations: Pypy

So, it happens that how text is used internally is an "implementation detail", and Pypy from version 7.1 onwards does use utf-8 byte strings internally for its text objects.

Unlike Ramalho's naive "leanstr" above, however, they do keep an index for each 4th utf-8 char so that char access by index can still be made in O(1). I did not find any docs about it, but the code for creating the index is here.

I've mentioned this question on twiter, as I am an acquittance of Ramalho, and eventually Carl Friederich Bolz-Terich, one of Pypy developers, reached back:

It's worked really quite well for us! Most Unicode strings don't need this index, and zero copy utf-8 decoding is quite cool. What's most annoying is actually str.find, because there you need the reverse conversion, from byte index to char index. we don't have an index for that.

回复收藏 0 原文

~没有更多了~