关于python中字符串实例唯一性的问题

发布于 2024-07-27 20:31:27 字数 711 浏览 2 评论 0原文

我试图找出哪些整数 python 只实例化一次（似乎是 -6 到 256），在这个过程中偶然发现了一些字符串行为，我看不到其中的模式。有时，以不同方式创建的相等字符串共享相同的字符串id，有时不是。此代码：

A = "10000"
B = "10000"
C = "100" + "00"
D = "%i"%10000
E = str(10000)
F = str(10000)
G = str(100) + "00"
H = "0".join(("10","00"))

for obj in (A,B,C,D,E,F,G,H):
    print obj, id(obj), obj is A

prints:

10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959456 False
10000 4959488 False
10000 4959520 False
10000 4959680 False

我什至没有看到该模式 - 除了前四个没有显式函数调用这一事实 - 但肯定不可能是这样，因为 "+”意味着对 add 的函数调用。我特别不明白为什么 C 和 G 不同，因为这意味着加法组件的 id 比结果更重要。

那么，AD到底经过了怎样的特殊处理，才使得它们成为同一个个体呢？

原文

I was trying to figure out which integers python only instantiates once (-6 to 256 it seems), and in the process stumbled on some string behaviour I can't see the pattern in. Sometimes, equal strings created in different ways share the same id, sometimes not. This code:

A = "10000"
B = "10000"
C = "100" + "00"
D = "%i"%10000
E = str(10000)
F = str(10000)
G = str(100) + "00"
H = "0".join(("10","00"))

for obj in (A,B,C,D,E,F,G,H):
    print obj, id(obj), obj is A

prints:

10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959456 False
10000 4959488 False
10000 4959520 False
10000 4959680 False

I don't even see the pattern - save for the fact that the first four don't have an explicit function call - but surely that can't be it, since the "+" in C for example implies a function call to add. I especially don't understand why C and G are different, seeing as that implies that the ids of the components of the addition are more important than the outcome.

So, what is the special treatment that A-D undergo, making them come out as the same instance?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时间你老了 2024-08-03 20:31:28

Python 允许内联字符串常量； A、B、C、D 实际上是相同的文字（如果 Python 看到常量表达式，它会将其视为常量）。

str 实际上是一个类，因此 str(whatever) 正在调用此类的构造函数，这应该会产生一个新的对象。这解释了 E、F、G（请注意，它们中的每一个都有单独的标识）。

至于 H，我不确定，但我会寻求解释，这个表达式对于 Python 来说太复杂了，无法弄清楚它实际上是一个常量，因此它计算一个新字符串。

回复收藏 0 原文

薄情伤 2024-08-03 20:31:28

我相信可以在编译时评估的短字符串将被自动保留。在最后的示例中，无法在编译时评估结果，因为 str 或 join 可能会被重新定义。

回复收藏 0 原文

西瑶 2024-08-03 20:31:28

回答 S.Lott 检查字节码的建议：

import dis
def moo():
    A = "10000"
    B = "10000"
    C = "100" + "00"
    D = "%i"%10000
    E = str(10000)
    F = str(10000)
    G = "1000"+str(0)
    H = "0".join(("10","00"))
    I = str("10000")

    for obj in (A,B,C,D,E,F,G,H, I):
        print obj, id(obj), obj is A
moo()
print dis.dis(moo)

产量：

10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 2840928 False
10000 2840896 False
10000 2840864 False
10000 2840832 False
10000 4968128 True
  4           0 LOAD_CONST               1 ('10000')
              3 STORE_FAST               0 (A)

  5           6 LOAD_CONST               1 ('10000')
              9 STORE_FAST               1 (B)

  6          12 LOAD_CONST              10 ('10000')
             15 STORE_FAST               2 (C)

  7          18 LOAD_CONST              11 ('10000')
             21 STORE_FAST               3 (D)

  8          24 LOAD_GLOBAL              0 (str)
             27 LOAD_CONST               5 (10000)
             30 CALL_FUNCTION            1
             33 STORE_FAST               4 (E)

  9          36 LOAD_GLOBAL              0 (str)
             39 LOAD_CONST               5 (10000)
             42 CALL_FUNCTION            1
             45 STORE_FAST               5 (F)

 10          48 LOAD_CONST               6 ('1000')
             51 LOAD_GLOBAL              0 (str)
             54 LOAD_CONST               7 (0)
             57 CALL_FUNCTION            1
             60 BINARY_ADD          
             61 STORE_FAST               6 (G)

 11          64 LOAD_CONST               8 ('0')
             67 LOAD_ATTR                1 (join)
             70 LOAD_CONST              12 (('10', '00'))
             73 CALL_FUNCTION            1
             76 STORE_FAST               7 (H)

 12          79 LOAD_GLOBAL              0 (str)
             82 LOAD_CONST               1 ('10000')
             85 CALL_FUNCTION            1
             88 STORE_FAST               8 (I)

 14          91 SETUP_LOOP              66 (to 160)
             94 LOAD_FAST                0 (A)
             97 LOAD_FAST                1 (B)
            100 LOAD_FAST                2 (C)
            103 LOAD_FAST                3 (D)
            106 LOAD_FAST                4 (E)
            109 LOAD_FAST                5 (F)
            112 LOAD_FAST                6 (G)
            115 LOAD_FAST                7 (H)
            118 LOAD_FAST                8 (I)
            121 BUILD_TUPLE              9
            124 GET_ITER            
        >>  125 FOR_ITER                31 (to 159)
            128 STORE_FAST               9 (obj)

 15         131 LOAD_FAST                9 (obj)
            134 PRINT_ITEM          
            135 LOAD_GLOBAL              2 (id)
            138 LOAD_FAST                9 (obj)
            141 CALL_FUNCTION            1
            144 PRINT_ITEM          
            145 LOAD_FAST                9 (obj)
            148 LOAD_FAST                0 (A)
            151 COMPARE_OP               8 (is)
            154 PRINT_ITEM          
            155 PRINT_NEWLINE       
            156 JUMP_ABSOLUTE          125
        >>  159 POP_BLOCK           
        >>  160 LOAD_CONST               0 (None)
            163 RETURN_VALUE

所以看起来编译器确实理解 AD 意味着同样的事情，因此它只生成一次就节省了内存（正如 Alex、Maciej 和 Greg 所建议的））。（添加的情况 I 似乎只是 str() 意识到它正在尝试从字符串创建一个字符串，然后将其传递出去。）

谢谢大家，现在清楚多了。

in answer to S.Lott's suggestion of examining the byte code:

import dis
def moo():
    A = "10000"
    B = "10000"
    C = "100" + "00"
    D = "%i"%10000
    E = str(10000)
    F = str(10000)
    G = "1000"+str(0)
    H = "0".join(("10","00"))
    I = str("10000")

    for obj in (A,B,C,D,E,F,G,H, I):
        print obj, id(obj), obj is A
moo()
print dis.dis(moo)

yields:

10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 2840928 False
10000 2840896 False
10000 2840864 False
10000 2840832 False
10000 4968128 True
  4           0 LOAD_CONST               1 ('10000')
              3 STORE_FAST               0 (A)

  5           6 LOAD_CONST               1 ('10000')
              9 STORE_FAST               1 (B)

  6          12 LOAD_CONST              10 ('10000')
             15 STORE_FAST               2 (C)

  7          18 LOAD_CONST              11 ('10000')
             21 STORE_FAST               3 (D)

  8          24 LOAD_GLOBAL              0 (str)
             27 LOAD_CONST               5 (10000)
             30 CALL_FUNCTION            1
             33 STORE_FAST               4 (E)

  9          36 LOAD_GLOBAL              0 (str)
             39 LOAD_CONST               5 (10000)
             42 CALL_FUNCTION            1
             45 STORE_FAST               5 (F)

 10          48 LOAD_CONST               6 ('1000')
             51 LOAD_GLOBAL              0 (str)
             54 LOAD_CONST               7 (0)
             57 CALL_FUNCTION            1
             60 BINARY_ADD          
             61 STORE_FAST               6 (G)

 11          64 LOAD_CONST               8 ('0')
             67 LOAD_ATTR                1 (join)
             70 LOAD_CONST              12 (('10', '00'))
             73 CALL_FUNCTION            1
             76 STORE_FAST               7 (H)

 12          79 LOAD_GLOBAL              0 (str)
             82 LOAD_CONST               1 ('10000')
             85 CALL_FUNCTION            1
             88 STORE_FAST               8 (I)

 14          91 SETUP_LOOP              66 (to 160)
             94 LOAD_FAST                0 (A)
             97 LOAD_FAST                1 (B)
            100 LOAD_FAST                2 (C)
            103 LOAD_FAST                3 (D)
            106 LOAD_FAST                4 (E)
            109 LOAD_FAST                5 (F)
            112 LOAD_FAST                6 (G)
            115 LOAD_FAST                7 (H)
            118 LOAD_FAST                8 (I)
            121 BUILD_TUPLE              9
            124 GET_ITER            
        >>  125 FOR_ITER                31 (to 159)
            128 STORE_FAST               9 (obj)

 15         131 LOAD_FAST                9 (obj)
            134 PRINT_ITEM          
            135 LOAD_GLOBAL              2 (id)
            138 LOAD_FAST                9 (obj)
            141 CALL_FUNCTION            1
            144 PRINT_ITEM          
            145 LOAD_FAST                9 (obj)
            148 LOAD_FAST                0 (A)
            151 COMPARE_OP               8 (is)
            154 PRINT_ITEM          
            155 PRINT_NEWLINE       
            156 JUMP_ABSOLUTE          125
        >>  159 POP_BLOCK           
        >>  160 LOAD_CONST               0 (None)
            163 RETURN_VALUE

so it would seem that indeed the compiler understands A-D to mean the same thing, and so it saves memory by only generating it once (as suggested by Alex,Maciej and Greg). (added case I seems to just be str() realising it's trying to make a string from a string, and just passing it through.)

Thanks everyone, that's a lot clearer now.

回复收藏 0 原文

鼻尖触碰 2024-08-03 20:31:27

就语言规范而言，对于任何不可变类型的实例，完全允许任何兼容的 Python 编译器和运行时创建一个新实例或查找等于所需值的相同类型的现有实例，并使用对同一个实例。这意味着在不可变对象之间使用 is 或 by-id 比较总是不正确的，任何次要版本都可能会调整或更改这方面的策略以增强优化。

就实现而言，权衡非常明显：尝试重用现有实例可能意味着要花费时间（也许是浪费时间）来尝试找到这样的实例，但如果尝试成功，则会节省一些内存（以及分配时间）然后释放保存新实例所需的内存位）。

如何解决这些实现权衡并不完全显而易见——如果您可以识别启发式方法，表明可能找到合适的现有实例并且搜索（即使失败）会很快，那么您可能想要尝试搜索和- 当启发式建议时重用，否则跳过。

在您的观察中，您似乎发现了一个特定的点释放实现，它在完全安全、快速且简单的情况下执行少量的窥孔优化，因此分配 A 到 D 都归结为与 A 完全相同（但 E 到F 不这样做，因为它们涉及优化器作者可能合理地认为假设语义不是 100% 安全的命名函数或方法——如果这样做的话投资回报率很低——所以它们不是窥孔优化的）。

因此，A 到 D 重用同一个实例可以归结为 A 和 B 这样做（因为 C 和 D 已针对完全相同的构造进行了窥视孔优化）。

反过来，这种重用清楚地表明了编译器策略/优化器启发式，即同一函数的本地命名空间中不可变类型的相同文字常量被折叠为对函数 .func_code.co_consts 中仅一个实例的引用（使用当前 CPython 的术语来表示函数和代码对象的属性）——合理的策略和启发式，因为在一个函数中重用相同的不可变常量文字有些频繁，并且代价仅支付一次（在编译时），而优势多次累积（每次函数运行时，可能在循环内等）。

（碰巧的是，这些特定的策略和启发式，考虑到它们明显积极的权衡，在 CPython 的所有最新版本中都很普遍，我相信，IronPython、Jython 和 PyPy 也是如此；-）。

如果您计划为 Python 本身或类似语言编写编译器、运行时环境、窥孔优化器等，那么这是一个值得研究且有趣的领域。我猜想深入研究内部结构（当然，最好是许多不同的正确实现，以免专注于某个特定实现的怪癖——这是一件好事，Python 目前至少有 4 个独立的、值得生产的实现，更不用说每个都有几个版本！）也可以间接地帮助一个人成为更好的 Python 程序员——但是尤其重要的是要关注语言本身保证的东西，这比你所希望的要少一些。在单独的实现中找到共同点，因为现在“恰好”共同的部分（语言规范不要求如此）可能会在下一个点在您的领导下完全改变发布一个或另一个实现，如果您的生产代码错误地依赖此类细节，则可能会导致令人讨厌的意外；-)。另外——依赖这种可变的实现细节而不是语言规定的行为几乎没有必要，甚至特别有用（当然，除非您正在编写诸如优化器、调试器、分析器等之类的东西；- ）。

In terms of language specification, any compliant Python compiler and runtime is fully allowed, for any instance of an immutable type, to make a new instance OR find an existing instance of the same type that's equal to the required value and use a new reference to that same instance. This means it's always incorrect to use is or by-id comparison among immutables, and any minor release may tweak or change strategy in this matter to enhance optimization.

In terms of implementations, the tradeoff are pretty clear: trying to reuse an existing instance may mean time spent (perhaps wasted) trying to find such an instance, but if the attempt succeeds then some memory is saved (as well as the time to allocate and later free the memory bits needed to hold a new instance).

How to solve those implementation tradeoffs is not entirely obvious -- if you can identify heuristics that indicate that finding a suitable existing instance is likely and the search (even if it fails) will be fast, then you may want to attempt the search-and-reuse when the heuristics suggest it, but skip it otherwise.

In your observations you seem to have found a particular dot-release implementation that performs a modicum of peephole optimization when that's entirely safe, fast, and simple, so the assignments A to D all boil down to exactly the same as A (but E to F don't, as they involve named functions or methods that the optimizer's authors may reasonably have considered not 100% safe to assume semantics for -- and low-ROI if that was done -- so they're not peephole-optimized).

Thus, A to D reusing the same instance boils down to A and B doing so (as C and D get peephole-optimized to exactly the same construct).

That reuse, in turn, clearly suggests compiler tactics/optimizer heuristics whereby identical literal constants of an immutable type in the same function's local namespace are collapsed to references to just one instance in the function's .func_code.co_consts (to use current CPython's terminology for attributes of functions and code objects) -- reasonable tactics and heuristics, as reuse of the same immutable constant literal within one function are somewhat frequent, AND the price is only paid once (at compile time) while the advantage is accrued many times (every time the function runs, maybe within loops etc etc).

(It so happens that these specific tactics and heuristics, given their clearly-positive tradeoffs, have been pervasive in all recent versions of CPython, and, I believe, IronPython, Jython, and PyPy as well;-).

This is a somewhat worthy and interesting are of study if you're planning to write compilers, runtime environments, peephole optimizers, etc etc, for Python itself or similar languages. I guess that deep study of the internals (ideally of many different correct implementations, of course, so as not to fixate on the quirks of a specific one -- good thing Python currently enjoys at least 4 separate production-worthy implementations, not to mention several versions of each!) can also help, indirectly, make one a better Python programmer -- but it's particularly important to focus on what's guaranteed by the language itself, which is somewhat less than what you'll find in common among separate implementations, because the parts that "just happen" to be in common right now (without being required to be so by the language specs) may perfectly well change under you at the next point release of one or another implementation and, if your production code was mistakenly relying on such details, that might cause nasty surprises;-). Plus -- it's hardly ever necessary, or even particularly helpful, to rely on such variable implementation details rather than on language-mandated behavior (unless you're coding something like an optimizer, debugger, profiler, or the like, of course;-).

回复收藏 0 原文

~没有更多了~