我最近一直在阅读有关不可变字符串的内容 为什么字符串在 Java 和 .NET 中不能可变? 和 为什么 .NET String 是不可变的?以及一些关于为什么D 选择了不可变字符串。似乎有很多优点。
- 线程安全、
- 更安全、
- 在大多数用例中,
- 内存效率更高。便宜的子字符串(标记化和切片)
更不用说大多数新语言都具有不可变字符串,D2.0、Java、C#、Python 等。C
++ 会从不可变字符串中受益吗?
是否有可能在 c++(或 c++0x)中实现具有所有这些优点的不可变字符串类?
更新:
对不可变字符串有两次尝试 const_string 和 fix_str。五年来两者都没有更新过。它们甚至被使用过吗?为什么 const_string 没有进入 boost ?
I've recent been reading about immutable strings Why can't strings be mutable in Java and .NET? and Why .NET String is immutable? as well some stuff about why D chose immutable strings. There seem to be many advantages.
- trivially thread safe
- more secure
- more memory efficient in most use cases.
- cheap substrings (tokenizing and slicing)
Not to mention most new languages have immutable strings, D2.0, Java, C#, Python, etc.
Would C++ benefit from immutable strings?
Is it possible to implement an immutable string class in c++ (or c++0x) that would have all of these advantages?
update:
There are two attempts at immutable strings const_string and fix_str. Neither have been updated in half a decade. Are they even used? Why didn't const_string ever make it into boost?
发布评论
评论(12)
我发现这个帖子中的大多数人并不真正理解什么是immutable_string。这不仅仅与常量有关。 immutable_string 的真正强大之处在于性能(即使在单线程程序中)和内存使用情况。
想象一下,如果所有字符串都是不可变的,并且所有字符串都像这样实现,
我们如何实现 sub-str 操作?我们不需要复制任何字符。我们所要做的就是分配
_head
和_len
。那么子字符串与源字符串共享相同的内存段。当然,我们不能仅用两个数据成员来真正实现immutable_string。真正的实现可能需要一个引用计数(或fly-weighted)内存块。这样,
在大多数情况下,内存和性能都会比传统字符串更好,尤其是当你知道自己在做什么时。
当然,C++ 可以从不可变字符串中受益,并且拥有一个很好。我检查了 Cubbi 提到的
boost::const_string
和fix_str
。我要说的应该就是这些。I found most people in this thread do not really understand what
immutable_string
is. It is not only about the constness. The really power ofimmutable_string
is the performance (even in single thread program) and the memory usage.Imagine that, if all strings are immutable, and all string are implemented like
How can we implement a sub-str operation? We don't need to copy any char. All we have to do is assign the
_head
and the_len
. Then the sub-string shares the same memory segment with the source string.Of course we can not really implement a immutable_string only with the two data members. The real implementation might need a reference-counted(or fly-weighted) memory block. Like this
Both the memory and the performance would be better than the traditional string in most cases, especially when you know what you are doing.
Of course C++ can benefit from immutable string, and it is nice to have one. I have checked the
boost::const_string
and thefix_str
mentioned by Cubbi. Those should be what I am talking about.作为一个意见:
它真的值得做吗(作为标准库功能)?我想说不是。 const 的使用为您提供了本地不可变的字符串,而系统编程语言的基本性质意味着您确实需要可变字符串。
As an opinion:
Is it really worth doing (as a standard library feature)? I would say not. The use of const gives you locally immutable strings, and the basic nature of systems programming languages means that you really do need mutable strings.
我的结论是,C++ 不需要不可变模式,因为它具有 const 语义。
在 Java 中,如果您有一个
Person
类,并且使用getName()
方法返回该人的String name
,那么您唯一的保护是不可变的模式。如果它不在那里,您将不得不整夜clone()
您的字符串(因为您必须处理不是典型值对象的数据成员,但仍然需要受到保护) 。在 C++ 中,你有 const std::string& getName() const.因此,您可以编写
SomeFunction(person.getName())
,类似于void SomeFunction(const std::string& subject)
。My conclusion is that C++ does not require the immutable pattern because it has const semantics.
In Java, if you have a
Person
class and you return theString name
of the person with thegetName()
method, your only protection is the immutable pattern. If it would not be there you would have toclone()
your strings all night and day (as you have to do with data members that are not typical value-objects, but still needs to be protected).In C++ you have
const std::string& getName() const
. So you can writeSomeFunction(person.getName())
where it is likevoid SomeFunction(const std::string& subject)
.你当然不是唯一这么想的人。事实上,有一个由 Maxim Yegorushkin 编写的 const_string 库,它似乎是在编写时考虑将其包含到 boost 中。这里有一个较新的库,fix_str,作者:Roland Pibinger。我不确定运行时的完整字符串实习有多棘手,但大多数优点在必要时都是可以实现的。
You're certainly not the only person who though that. In fact, there is const_string library by Maxim Yegorushkin, which seems to have been written with inclusion into boost in mind. And here's a little newer library, fix_str by Roland Pibinger. I'm not sure how tricky would full string interning at run-time be, but most of the advantages are achievable when necessary.
我认为这里没有明确的答案。这是主观的——如果不是因为个人品味,那么至少是因为人们最常处理的代码类型。 (仍然是一个有价值的问题。)
当内存便宜时,不可变字符串非常有用——开发 C++ 时情况并非如此,而且并非 C++ 所针对的所有平台上都是如此。 (OTOH 在更有限的平台上 C 似乎比 C++ 更常见,因此这个论点很弱。)
您可以在 C++ 中创建一个不可变的字符串类,并且可以使其在很大程度上与
std::string
兼容—但是与具有专用优化和语言功能的内置字符串类相比,您仍然会失败。std::string
是我们得到的最好的标准字符串,所以我不希望看到任何混乱。不过我很少使用它;std::string
在我看来有太多缺点。I don't think there's a definitive answer here. It's subjective—if not because personal taste then at least because of the type of code one most often deals with. (Still, a valuable question.)
Immutable strings are great when memory is cheap—this wasn't true when C++ was developed, and it isn't the case on all platforms targeted by C++. (OTOH on more limited platforms C seems much more common than C++, so that argument is weak.)
You can create an immutable string class in C++, and you can make it largely compatible with
std::string
—but you will still lose when comparing to a built-in string class with dedicated optimizations and language features.std::string
is the best standard string we get, so I wouldn't like to see any messing with it. I use it very rarely, though;std::string
has too many drawbacks from my point of view.就这样吧。字符串文字也是不可变的,除非您想进入未定义的行为。
编辑:当然,这只是故事的一半。 const 字符串变量没有用,因为您无法使其引用新字符串。对 const 字符串的引用可以做到这一点,但 C++ 不允许您像 Python 等其他语言那样重新分配引用。最接近的是指向动态分配字符串的智能指针。
There you go. A string literal is also immutable, unless you want to get into undefined behavior.
Edit: Of course that's only half the story. A const string variable isn't useful because you can't make it reference a new string. A reference to a const string would do it, except that C++ won't allow you to reassign a reference as in other languages like Python. The closest thing would be a smart pointer to a dynamically allocated string.
不可变字符串很棒如果,每当需要创建新字符串时,内存管理器将始终能够确定每个字符串引用的位置。在大多数平台上,可以以相对适中的成本提供对这种能力的语言支持,但在没有内置这种语言支持的平台上,要困难得多。
例如,如果想要在 x86 上设计一个支持不可变字符串的 Pascal 实现,则字符串分配器必须能够遍历堆栈以查找所有字符串引用;唯一的执行时间成本是需要一致的函数调用方法[例如不使用尾调用,并且让每个非叶函数维护一个帧指针]。每个用 new 分配的内存区域都需要有一个位来指示它是否包含任何字符串,而那些包含字符串的内存区域需要有一个内存布局描述符的索引,但这些成本将是相当轻微。
如果 GC 不是用于遍历堆栈的表,则需要让代码使用句柄而不是指针,并让代码在局部变量进入作用域时创建字符串句柄,并在局部变量超出作用域时销毁句柄。开销要大得多。
Immutable strings are great if, whenever it's necessary to create a new a string, the memory manager will always be able to determine determine the whereabouts of every string reference. On most platforms, language support for such ability could be provided at relatively modest cost, but on platforms without such language support built in it's much harder.
If, for example, one wanted to design a Pascal implementation on x86 that supported immutable strings, it would be necessary for the string allocator to be able to walk the stack to find all string references; the only execution-time cost of that would be requiring a consistent function-call approach [e.g. not using tail calls, and having every non-leaf function maintain a frame pointer]. Each memory area allocated with
new
would need to have a bit to indicate whether it contained any strings and those that do contain strings would need to have an index to a memory-layout descriptor, but those costs would be pretty slight.If a GC wasn't table to walk the stack, then it would be necessary to have code use handles rather than pointers, and have code create string handles when local variables come into scope, and destroy the handles when they go out of scope. Much greater overhead.
Qt 还使用不可变字符串和写入时复制。
关于像样的编译器到底能带来多少性能,存在一些争论。
Qt also uses immutable strings with copy-on-write.
There is some debate about how much performance it really buys you with decent compilers.
常量字符串对于值语义来说没有什么意义,并且共享并不是 C++ 的最大优势之一......
constant strings make little sense with value semantics, and sharing isn't one of C++'s greatest strengths...
C++ 会从不可变字符串中受益吗?可能不多。
不可变字符串与只读字符串不同。不变性保证字符串的可观察状态不会发生在您自己的代码可以影响的范围之外的情况,以至于如果您采用任何传递 std::string 的代码> 按值,您可以将其替换为这样一个不可变的字符串指针,并且一切都会起作用(您无法区分按值传递这样的字符串与按引用传递)。
仅当您从一开始就创建
const
对象时,C++ 才能保证这一点 - 将const
添加到现有对象不会使其不可变,您可以通过删除它const_cast
任何时候都不会出现任何问题。这基本上意味着你不能在 C++ 中使类型不可变,只能是一个对象:这里实际上有两个保证 - 你不能通过
const_cast(str. data())
(修改
(修改 const 对象时的未定义行为),并且无法通过 const_caststr
。 std::string&>(str)data() const 的未定义行为)。理论上你可以得到
const_cast(str).data()
,并且修改它对于一个简单的std::string
实现来说可能没问题,但是标准的 std::string 可能会被优化,以便它将字符数据(最多一定大小)存储在对象本身中,因此您仍然有修改对象的风险。如何传递不可变的对象?你不能。
这不再是一个不可变的对象,而是一个只读对象——您无法通过
str
修改它,但它仍然可以随时更改。要拥有一个实际上不可变的字符串,您需要以某种方式在类型系统中对其进行编码。最好的方法是将其包装在另一个不可变的对象中,例如 const std::tuple; &str - 不存在允许从可变
std::tuple
获取此引用的元组差异,因此,在创建时,该字符串已经是一个 const 对象。最后,您还需要保证此类对象的生命周期,因为即使是不可变的对象也可以被删除。值得庆幸的是,这并没有那么复杂 -
std::shared_ptr>
为您提供了所有保证 - 一旦您观察到它的值,它应该保持不变< em>永远。这几乎是具有不可变字符串的语言为您提供的基础。现在,要推理出答案,C++ 将如何从使用这样的字符串中受益?是否有任何地方存储只读字符串,而不让您控制特定类型?我能想出的东西并不多:
std::exception
——从错误消息构造它需要复制字符串来保存它,但是使用不可变的字符串指针,它可以只存储它并在what()
中返回。但是,您可以创建自己的异常类型来执行此操作。std::locale
- 同样,从区域设置名称构造可以仅将不可变字符串存储在内部,而无需复制。然而,区域设置名称并不经常使用,并且足够短,足以在大多数情况下进行小字符串优化。std::messages
-get()
可能是不可变字符串的主要目标,就像任何其他“字符串目录”一样 - 检索可能相对频繁地发生,并且字符串适度足够长且恒定。但是,没有什么可以阻止您向该对象添加缓存层。我鼓励您找到更多这样的例子。事实上,接受和处理、生成和返回字符串的情况比简单存储的情况要多得多。
话虽这么说,用户代码中的情况却截然不同——一旦您存储了用户记录或配置,您就变得更需要这种类型。因此,虽然 C++ 语言本身可能不会从中受益,但您的代码肯定会受益!这样的类型需要:
std::string
移动构造,获取其字符数据的所有权。std::shared_ptr>
等内容进行复制,其中类型不保证不变性。该选项需要明确标记为危险,并且在结果的生命周期内以任何方式修改底层数据都应该导致未定义的行为。std::shared_ptr
的代码很好地接口,最好作为std::shared_ptr
本身进行处理。const char *data() const
的形式进行检索。请注意,切片字符串末尾可能会缺少'\0'
字符,因此需要有一种方法来检测这种情况的发生并返回std::shared_ptr
代替(或者为原始数据起别名,或者为末尾带有'\0'
的副本)。std::shared_ptr
和std::string
之间的相互作用使得这种类型绝对更适合由标准而不是用户代码定义,因为某些情况可以处理得更多无需不必要的分配或间接,效果更好,并且不变性保证可以实现重要的优化。Would C++ benefit from immutable strings? Probably not much.
An immutable string is not the same as a read-only string. Immutability guarantees that no change to the observable state of the string may occur outside of what your own code can affect, to the point that if you take any code passing
std::string
by value, you could just replace it with such an immutable string pointer and everything will work (you cannot distinguish passing such a string by value from passing by reference).C++ guarantees this only when you create a
const
object right from the beginning ‒ addingconst
to an existing object does not make it immutable, and you can remove it viaconst_cast
any time without any issues. This basically means that you cannot make a type immutable in C++, only an object:There are actually two guarantees at play here ‒ you cannot modify
str
viaconst_cast<std::string&>(str)
(undefined behaviour when modifyingconst
object), and you cannot modify the character data viaconst_cast<char*>(str.data())
(undefined behaviour fordata() const
). You could theoretically getconst_cast<std::string&>(str).data()
, and modifying that might be fine for a trivialstd::string
implementation, but the standardstd::string
might be optimized so that it stores the character data (up to a size) in the object itself, thus you still risk modifying the object.How do you pass an immutable object around? You can't.
This is no longer an immutable object, but a read-only object ‒ you cannot modify it through
str
, but it can still change at any time.To have an actually immutable string, you need to encode this in the type system somehow. The best you can do is to wrap it in another object where it is immutable, like
const std::tuple<const std::string> &str
‒ there is no tuple variance that would permit getting this reference from a mutablestd::tuple<std::string>
, and so, when created, the string is already aconst
object.Lastly, you also need guarantees about the lifetime of such an object, since even an immutable object can be deleted. That is thankfully not so complicated ‒
std::shared_ptr<const std::tuple<const std::string>>
gives you all the guarantees ‒ once you observe its value, it should stay constant forever. This is pretty much the basis of what languages with immutable strings give you.Now, to reason about the answer, how would C++ benefit from using such a string? Are there any places where read-only strings are stored, without giving you the control over the specific type? There are not that many I could come up with:
std::exception
‒ constructing it from an error message needs to copy the string to preserve it, but with an immutable string pointer, it could just store that and return inwhat()
. However, you can create your own exception types that do that.std::locale
‒ likewise, constructing from a locale name could just store the immutable string inside, without copying. However, locale names are not that frequently used, and are short enough to make small string optimization kick in for most of cases.std::messages
‒get()
could be a primary target for immutable strings, as any other "string catalogs" ‒ retrieval could happen relatively often, and the strings are moderately long and constant enough. Nothing stops you from adding a caching layer to this object however.I encourage you to find more such examples. Indeed, there are many more cases where strings are accepted and processed, or generated and returned, than simply stored.
That being said, the situation in user code is drastically different ‒ once you store user records or configurations, you become more in need of such a type. So while the C++ language itself might not benefit from it, your code definitely would! Such a type needs to:
std::string
, taking ownership of its character data.std::shared_ptr<std::array<const char, N>>
or the aforementioned tuple.std::string
,char
iterators and the usual stuff, copying the data to its internal memory.std::shared_ptr<std::span<char>>
, where immutability is not guaranteed by the type. This option needs to be clearly marked as dangerous, and modifying the underlying data in any way during the lifetime of the result should cause undefined behaviour.std::shared_ptr
-based code, preferably be handled asstd::shared_ptr
itself.const char *data() const
for retrieval. Note that sliced strings may be missing the'\0'
character at the end, and so there needs to be a way to detect this having happened and returnstd::shared_ptr<const char*>
instead (either aliased to the original data, or to a copy thereof with'\0'
at the end).This interplay between
std::shared_ptr
andstd::string
makes such a type definitely preferable to be defined by the standard rather than user code, as some situations could be handled much better without unnecessary allocations or indirection, and the immutability guarantee enables important optimizations.Ruby 中的字符串是可变的。
我往往会忘记安全参数。如果你想线程安全,就锁定它,或者不要碰它。 C++ 不是一种方便的语言,有自己的约定。
不。一旦您有了指针算术和对地址空间的不受保护的访问,就忘记安全问题了。是的,更安全地防止无辜的错误编码。
除非你实现 CPU 密集型机制,否则我不知道如何实现。
这将是一个非常好的观点。可以通过引用具有反向引用的字符串来完成,其中对字符串的修改将导致复制。标记化和切片变得免费,而突变变得昂贵。
Strings are mutable in Ruby.
I would tend to forget safety arguments. If you want to be thread-safe, lock it, or don't touch it. C++ is not a convenient language, have your own conventions.
No. As soon as you have pointer arithmetics and unprotected access to the address space, forget about being secure. Safer against innocently bad coding, yes.
Unless you implement CPU-intensive mechanisms, I don't see how.
That would be one very good point. Could be done by referring to a string with backreferences, where modifications to a string would cause a copy. Tokenizing and slicing become free, mutations become expensive.
C++字符串是线程安全的,所有不可变对象都保证是线程安全的,但Java的StringBuffer像C++字符串一样是可变的,并且它们都是线程安全的。为什么要担心速度,使用 const 关键字定义方法或函数参数来告诉编译器该字符串在该范围内是不可变的。此外,如果字符串对象是按需不可变的,则在您绝对需要使用该字符串时等待,换句话说,当您将其他字符串附加到主字符串时,您将拥有一个字符串列表,直到您真正需要整个字符串为止,然后将它们连接起来在那一刻在一起。
据我所知,不可变对象和可变对象以相同的速度运行,除了它们的方法有利有弊。常量基元和变量基元以不同的速度移动,因为在机器级别,变量被分配到需要一些二进制操作的寄存器或内存空间,而常量是不需要任何这些操作的标签,因此速度更快(或完成的工作较少)。仅适用于基元,不适用于对象。
C++ strings are thread safe, all immutable objects are guaranteed to be thread safe but Java's StringBuffer is mutable like C++ string is and the both of them are thread safe. Why worry about speed, define your method or function parameters with the const keyword to tell the compiler the string will be immutable in that scope. Also if string object is immutable on demand, waiting when you absolutely need to use the string, in other words, when you append other strings to the main string, you have a list of strings until you actually need the whole string then they are joined together at that point.
immutable and mutable object operate at the same speed to my knowledge , except their methods which is a matter of pro and cons. constant primitives and variable primitives move at different speeds because at the machine level, variables are assigned to a register or a memory space which require a few binary operations, while constants are labels that don't require any of those and are thus faster (or less work is done). works only for primitives and not for object.