Python求和,为什么不是字符串?

发布于 2024-09-15 08:08:16 字数 799 浏览 8 评论 0原文

Python 有一个内置函数 sum,它实际上相当于:

def sum2(iterable, start=0):
    return start + reduce(operator.add, iterable)

对于除字符串之外的所有类型的参数。它适用于数字和列表,例如:

 sum([1,2,3], 0) = sum2([1,2,3],0) = 6    #Note: 0 is the default value for start, but I include it for clarity
 sum({888:1}, 0) = sum2({888:1},0) = 888

为什么要特别省略字符串?

 sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
 sum2(['foo','bar'], '') = 'foobar'

我似乎记得 Python 列表中的讨论,因此可以提供解释或链接到解释它的线程。

编辑:我知道标准方法是"".join。我的问题是为什么对字符串使用 sum 的选项被禁止,而对于列表等却没有禁止。

编辑2:虽然我相信鉴于我得到的所有好的答案,这不是必需的,但问题是:为什么 sum 对包含数字的可迭代或包含列表的可迭代有效,但对可迭代无效包含字符串?

Python has a built in function sum, which is effectively equivalent to:

def sum2(iterable, start=0):
    return start + reduce(operator.add, iterable)

for all types of parameters except strings. It works for numbers and lists, for example:

 sum([1,2,3], 0) = sum2([1,2,3],0) = 6    #Note: 0 is the default value for start, but I include it for clarity
 sum({888:1}, 0) = sum2({888:1},0) = 888

Why were strings specially left out?

 sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
 sum2(['foo','bar'], '') = 'foobar'

I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.

Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.

Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

A君 2024-09-22 08:08:16

Python 试图阻止您对字符串进行“求和”。你应该加入他们:

"".join(list_of_strings)

它更快,并且使用更少的内存。

一个快速基准:

$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop

编辑(回答OP的编辑):至于为什么字符串明显被“挑选出来”,我相信这只是针对常见情况进行优化以及执行最佳实践的问题:您可以更快地加入字符串与 ''.join,因此明确禁止 sum 上的字符串会向新手指出这一点。

顺便说一句,这个限制已经“永远”存在,即自从 sum 被添加为内置函数(修订版 32347)

Python tries to discourage you from "summing" strings. You're supposed to join them:

"".join(list_of_strings)

It's a lot faster, and uses much less memory.

A quick benchmark:

$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop

Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.

BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)

少年亿悲伤 2024-09-22 08:08:16

事实上,如果您使用适当的起始对象,您可以使用 sum(..) 来连接字符串!当然,如果您走到这一步,您已经足够了解如何使用 "".join(..) 了。

>>> class ZeroObject(object):
...  def __add__(self, other):
...   return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'

You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..

>>> class ZeroObject(object):
...  def __add__(self, other):
...   return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
过气美图社 2024-09-22 08:08:16

这是来源: http://svn .python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup

在builtin_sum函数中,我们有这段代码:

     /* reject string values for 'start' parameter */
        if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
            PyErr_SetString(PyExc_TypeError,
                "sum() can't sum strings [use ''.join(seq) instead]");
            Py_DECREF(iter);
            return NULL;
        }
        Py_INCREF(result);
    }

所以..这就是你的答案。

它已在代码中明确检查并被拒绝。

Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup

In the builtin_sum function we have this bit of code:

     /* reject string values for 'start' parameter */
        if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
            PyErr_SetString(PyExc_TypeError,
                "sum() can't sum strings [use ''.join(seq) instead]");
            Py_DECREF(iter);
            return NULL;
        }
        Py_INCREF(result);
    }

So.. that's your answer.

It's explicitly checked in the code and rejected.

夜还是长夜 2024-09-22 08:08:16

来自文档

连接一个的首选、快速的方法
字符串的序列是通过调用
''.join(序列)。

通过让 sum 拒绝对字符串进行操作,Python 鼓励您使用正确的方法。

From the docs:

The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).

By making sum refuse to operate on strings, Python has encouraged you to use the correct method.

挽手叙旧 2024-09-22 08:08:16

简短的回答:效率。

长答案:sum 函数必须为每个部分和创建一个对象。

假设创建对象所需的时间与其数据大小成正比。令 N 表示要求和的序列中的元素数量。

double 的大小始终相同,这使得 sum 的运行时间为 O(1)×N = O(N)

int(以前称为long)是任意长度。令M表示最大序列元素的绝对值。那么 sum 最坏情况的运行时间为 lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg( N!) = O(N log N)

对于 str(其中 M = 最长字符串的长度),最坏情况的运行时间为 M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²)

因此,对字符串求和会比对数字求和慢得多。

str.join 不分配任何中间对象。它预先分配一个足够大的缓冲区来容纳连接的字符串,并复制字符串数据。它的运行时间为O(N),比sum快得多。

Short answer: Efficiency.

Long answer: The sum function has to create an object for each partial sum.

Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.

doubles are always the same size, which makes sum's running time O(1)×N = O(N).

int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).

For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).

Thus, summing strings would be much slower than summing numbers.

str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.

沉睡月亮 2024-09-22 08:08:16

原因

@dan04 对在大型字符串列表上使用 sum 的成本有很好的解释。

关于为什么 str 不允许用于 sum 的缺失部分是,很多很多人都试图将 sum 用于字符串,但并不多对列表和元组以及其他 O(n**2) 数据结构使用 sum。陷阱是 sum 对于较短的字符串列表工作得很好,但是当投入到生产环境中时,列表可能会很大,并且性能会慢得像爬行一样。这是一个非常常见的陷阱,因此决定在这种情况下忽略鸭子类型,并且不允许将字符串与 sum 一起使用。

The Reason Why

@dan04 has an excellent explanation for the costs of using sum on large lists of strings.

The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.

背叛残局 2024-09-22 08:08:16

编辑:将有关不变性的部分移至历史记录。

基本上,这是一个预分配的问题。当您使用诸如 之类的语句

sum(["a", "b", "c", ..., ])

并期望它与 reduce 语句类似时,生成的代码看起来像这样

v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")

在每个步骤中都会创建一个新字符串,这可能会带来一些复制开销随着琴弦变得越来越长。但这也许不是重点。更重要的是,每行上的每个新字符串都必须分配到它的特定大小(我不知道它必须在reduce 语句,可能需要使用一些明显的启发式方法,Python 可能会在这里或那里分配更多一点以供重用 – 但在某些时候,新字符串将足够大,这不再有帮助,Python 必须再次分配,这是 。

然而,像join这样的专用方法需要在字符串开始之前计算出字符串的实际大小,因此理论上只会在开始时分配一次,然后填充新的字符串 string,这比其他解决方案便宜得多。

Edit: Moved the parts about immutability to history.

Basically, its a question of preallocation. When you use a statement such as

sum(["a", "b", "c", ..., ])

and expect it to work similar to a reduce statement, the code generated looks something like

v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")

In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.

A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.

诠释孤独 2024-09-22 08:08:16

我不知道为什么,但这有效!

import operator
def sum_of_strings(list_of_strings):
    return reduce(operator.add, list_of_strings)

I dont know why, but this works!

import operator
def sum_of_strings(list_of_strings):
    return reduce(operator.add, list_of_strings)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文