C# 中的字符串不变性

发布于 2024-09-16 01:48:17 字数 831 浏览 9 评论 0原文

我很好奇 StringBuilder 类的内部是如何实现的，所以我决定查看 Mono 的源代码，并将其与 Microsoft 实现的 Reflector 反汇编代码进行比较。本质上，微软的实现使用 char[] 来内部存储字符串表示形式，并使用一堆不安全的方法来操作它。这很简单，没有提出任何问题。但当我发现 Mono 在 StringBuilder 中使用字符串时，我很困惑：

private int _length;
private string _str;

第一个想法是：“多么无意义的 StringBuilder”。但后来我发现可以使用指针来改变字符串：

public StringBuilder Append (string value) 
{
     // ...
     String.CharCopy (_str, _length, value, 0, value.Length);
}

internal static unsafe void CharCopy (char *dest, char *src, int count) 
{
    // ...
    ((short*)dest) [0] = ((short*)src) [0]; dest++; src++;
}

我曾经用 C/C++ 编程过一点，所以我不能说这段代码让我很困惑，但我认为字符串是完全不可变的（即绝对没有办法改变它）。所以实际的问题是：

我可以创建一个完全不可变的类型吗？
除了性能问题之外，还有什么理由使用这样的代码吗？（更改不可变类型的不安全代码）
字符串本质上是线程安全的吗？

原文

I was curious how the StringBuilder class is implemented internally, so I decided to check out Mono's source code and compare it with Reflector's disassembled code of the Microsoft's implementation. Essentially, Microsoft's implementation uses char[] to store a string representation internally, and a bunch of unsafe methods to manipulate it. This is straightforward and did not raise any questions. But I was confused, when I found that Mono uses a string inside StringBuilder:

private int _length;
private string _str;

The first thought was: "What a senseless StringBuilder". But then I figured out that it is possible to mutate a string using pointers:

public StringBuilder Append (string value) 
{
     // ...
     String.CharCopy (_str, _length, value, 0, value.Length);
}

internal static unsafe void CharCopy (char *dest, char *src, int count) 
{
    // ...
    ((short*)dest) [0] = ((short*)src) [0]; dest++; src++;
}

I used to program in C/C++ a little, so I can't say that this code confused me much, but I thought that strings are completely immutable (i.e there is absolutely no way to mutate it). So the actual questions are:

Can I create a completely immutable type?
Is there any reason to use such code apart from performance concerns?
(unsafe code to change immutable types)
Are strings then inherently thread-safe or not?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

記憶穿過時間隧道 2024-09-23 01:48:17

我可以创建一个完全不可变的类型吗？

您可以创建一个 CLR 强制其不变性的类型。然后，您可以使用“unsafe”来关闭 CLR 强制机制。这就是为什么“不安全”被称为“不安全”——因为它关闭了安全系统。在不安全代码中，如果您足够努力，进程中的每个内存字节都可以写入，包括不可变字节和 CLR 中强制不变性的代码。

您还可以使用反射来打破不变性。反射和不安全代码都需要授予极高级别的信任。

除了性能问题之外，还有什么理由使用此类代码吗？

当然，使用不可变数据结构的原因有很多。不可变的数据结构岩石。使用不可变数据结构的一些充分理由：

不可变数据结构比可变数据结构更容易推理。当你问“这个列表是空的吗？”当你得到一个答案时，你就知道这个答案不仅是现在正确的，而且是永远正确的。对于可变数据结构，您实际上不能问“这个列表是空的吗？”您只能问“这个列表现在是空的吗？”然后答案逻辑上回答了“这个列表在过去某个时刻是否为空？”的问题。

关于不可变类型的问题的答案永远保持正确这一事实具有安全隐患。假设您有这样的代码：

void Frob(Bar bar)
{
    if (!IsSafe(bar)) throw something;
    DoSomethingDangerous(bar);
}

如果 Bar 是可变类型，那么这里存在竞争条件；检查之后但在危险发生之前，另一个线程上的 bar 可能会变得不安全。如果 Bar 是不可变类型，那么问题的答案始终保持不变，这更安全。（例如，想象一下，如果您可以在安全检查之后但在打开文件之前更改包含路径的字符串。）

将不可变数据结构视为它们的参数并将它们作为结果返回并且不执行任何副作用称为“纯方法”。纯方法可以被记忆，这会增加内存使用以提高速度，通常会极大地提高速度。
不可变数据结构通常可以在多个线程上同时使用而无需锁定。锁定是为了防止对象在发生突变时创建不一致的状态，但不可变对象没有突变。（一些所谓的不可变数据结构在逻辑上是不可变的，但实际上在其内部进行了突变；例如，想象一个查找表，它不会更改其内容，但如果它可以推断出下一个查询可能是什么，则它会重新组织其内部结构。这样的数据结构不会自动成为线程安全的。）
当从旧结构构建新结构时，不可变数据结构可以有效地重用其内部部分，从而可以轻松地“拍摄程序状态的快照”而不浪费大量内存。这使得撤消重做操作的实施变得微不足道。它使编写调试工具变得更加容易，这些工具可以向您展示如何到达特定的程序状态。
等等。

字符串本质上是线程安全的吗？

如果每个人都遵守规则，那就是。如果有人使用不安全的代码或私有反射，那么就不再执行规则。您必须相信，如果有人使用高权限代码，那么他们的做法是正确的，并且不会改变字符串。使用你的权力来运行不安全的代码只是为了好的目的；权力越大，责任越大。

那么我是否需要使用锁？

这是一个奇怪的问题。请记住，锁是协作的。仅当访问特定对象的每个人都同意必须使用的锁定策略时，锁才起作用。

如果访问特定存储位置中的特定对象的商定锁定策略是使用锁，则必须使用锁。如果这不是商定的锁定策略，那么使用锁就没有意义；当其他人走进打开的后门时，您正在小心地锁上和打开前门。

如果您知道一个字符串正在被不安全代码突变，并且您不希望看到不一致的部分突变，并且执行不安全突变的代码记录了它在该突变期间取出了特定的锁，那么是的，访问该字符串时需要使用锁。但这种情况非常罕见；理想情况下，没有人会使用不安全的代码来操作另一个线程上的其他代码可访问的字符串，因为这样做是一个非常糟糕的主意。这就是为什么我们要求执行此操作的代码是完全可信的。这就是为什么我们要求此类函数的 C# 源代码发出一个大红旗，上面写着“此代码不安全，请仔细检查！”

Can i create a completely immutable type?

You can create a type where the CLR enforces immutability on it. You can then use "unsafe" to turn off the CLR enforcement mechanisms. That's why "unsafe" is called "unsafe" - because it turns off the safety system. In unsafe code every single byte of memory in the process can be writable if you try hard enough, including both the immutable bytes and the code in the CLR which enforces immutability.

You can also use Reflection to break immutability. Both Reflection and unsafe code require an extremely high level of trust to be granted.

Is there any reason to use such code apart from performance concerns?

Sure, there are lots of reasons to use immutable data structures. Immutable data structures rock. Some good reasons to use immutable data structures:

immutable data structures are easier to reason about than mutable data structures. When you ask "is this list empty?" and you get an answer then you know that answer is correct not just now, but forever. With mutable data structures you cannot actually ask "is this list empty?" All you can ask is "is this list empty right now?" and then the answer logically answers the question "was this list empty at some point in the past?"

The fact that the answer to a question about an immutable type stays true forever has security implications. Suppose you have code like this:

void Frob(Bar bar)
{
    if (!IsSafe(bar)) throw something;
    DoSomethingDangerous(bar);
}

If Bar is a mutable type then there is a race condition here; bar could be made unsafe on another thread after the check but before something dangerous happens. If Bar is an immutable type then the answer to the question stays the same throughout, which is much safer. (Imagine if you could mutate a string containing a path after the security check but before the file was opened, for example.)

methods which take immutable data structures as their arguments and return them as their results and perform no side effects are called "pure methods". Pure methods can be memoized, which trades increased memory use for increased speed, often enormously increased speed.
immutable data structures can often be used on multiple threads simultaneously without locking. Locking is there to prevent creation of inconsistent state of an object in the face of a mutation, but immutable objects don't have mutations. (Some so-called immutable data structures are logically immutable but actually do mutations inside themselves; imagine for example a lookup table which does not change its contents, but does reorganize its internal structure if it can deduce what the next query is likely to be. Such a data structure would not be automatically threadsafe.)
immutable data structures that efficiently re-use their internal parts when a new structure is built from an old one make it easy to "take a snapshot" of the state of a program without wasting lots of memory. That makes undo-redo operations trivial to implement. It makes it easier to write debugging tools that can show you how you got to a particular program state.
and so on.

Are strings then inherently thread-safe or not?

If everyone plays by the rules, they are. If someone uses unsafe code or private reflection then there is no rule enforcement anymore. You have to trust that if someone is using high-privilege code then they are doing so correctly and not mutating a string. Use your power to run unsafe code only for good; with great power comes great responsibility.

So do I need to use locks or not?

That is a strange question. Remember, locks are co-operative. Locks only work if everyone accessing a particular object agrees upon the locking strategy that must be used.

You have to use locks if the agreed-upon locking strategy for accessing particular object in a particular storage location is to use locks. If that isn't the agreed-upon locking strategy then using locks is pointless; you're carefully locking and unlocking the front door while someone else is walking in the open back door.

If you have a string which you know is being mutated by unsafe code, and you don't want to see inconsistent partial mutations, and the code which is doing the unsafe mutation documents that it takes out a particular lock during that mutation, then yes, you need to use locks when accessing that string. But this situation is very rare; ideally no one would use unsafe code to manipulate a string accessible by other code on another thread, because doing so is an incredibly bad idea. That's why we require that code that does so is fully trusted. And that's why we require that the C# source code for such a function wave a big red flag that says "this code is unsafe, review it carefully!"

回复收藏 0 原文