如果字符串在 .NET 中是不可变的,那么为什么 Substring 需要 O(n) 时间?
鉴于字符串在 .NET 中是不可变的,我想知道为什么它们被设计成 string.Substring()
需要 O(substring.Length
) 时间,而不是O(1)
?
即,有什么权衡(如果有的话)?
Given that strings are immutable in .NET, I'm wondering why they have been designed such that string.Substring()
takes O(substring.Length
) time, instead of O(1)
?
i.e. what were the tradeoffs, if any?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
更新:我非常喜欢这个问题,我刚刚在博客上写了它。请参阅字符串、不变性和持久性
简短的答案是: 如果 n 不变大,O(n) 就是 O(1)。大多数人都是从微小的字符串中提取微小的子字符串,因此复杂性如何渐近增长完全是无关紧要。。
长答案是:
建立一个不可变的数据结构,使得实例上的操作允许重复使用原始内存,而只需少量(通常为 O(1) 或 O(lg n))复制或新分配称为“持久”不可变数据结构。 .NET 中的字符串是不可变的;你的问题本质上是“为什么他们不坚持”?
因为当您查看 .NET 程序中通常对字符串执行的操作时,会发现,在所有相关方面,简单地创建一个全新的字符串都一点也不差。 构建复杂的持久数据结构的费用和困难本身是不值得的。
人们通常使用“子字符串”从某种形式中提取一个短字符串(例如十个或二十个字符) 。更长的字符串——可能有几百个字符。您在逗号分隔的文件中有一行文本,并且您想要提取第三个字段,即姓氏。该行可能有几百个字符长,名称将有几十个。在现代硬件上,五十字节的字符串分配和内存复制速度快得惊人。创建一个由指向现有字符串中间的指针加上一个长度组成的新数据结构也快得惊人,这是无关紧要的; “足够快”顾名思义就是足够快。
提取的子串通常尺寸较小且寿命较短;垃圾收集器很快就会回收它们,而且它们一开始并没有在堆上占用太多空间。因此,使用鼓励重用大部分内存的持久策略也不是一个胜利。您所做的只是让垃圾收集器变得更慢,因为现在它必须担心处理内部指针。
如果人们通常对字符串执行的子字符串操作完全不同,那么采用持久方法是有意义的。如果人们通常有数百万个字符的字符串,并且要提取数千个大小在十万个字符范围内的重叠子字符串,并且这些子字符串在堆上存在很长时间,那么使用持久子字符串是非常有意义的方法;不这样做将是浪费和愚蠢的。但是大多数业务线程序员不会做任何类似的事情。 .NET 不是一个为人类基因组计划的需求量身定制的平台; DNA分析程序员每天必须解决那些字符串使用特征的问题;你很可能不这样做。少数人确实构建了自己的持久数据结构,与他们的使用场景紧密匹配。
例如,我的团队编写的程序可以在您键入 C# 和 VB 代码时对其进行即时分析。其中一些代码文件巨大,因此我们无法进行 O(n) 字符串操作来提取子字符串或插入或删除字符。我们构建了一堆持久不可变的数据结构来表示对文本缓冲区的编辑,这使我们能够快速有效地重用大量现有的字符串数据以及现有的词法和句法分析。典型的编辑。这是一个很难解决的问题,其解决方案是针对 C# 和 VB 代码编辑的特定领域量身定制的。指望内置的字符串类型为我们解决这个问题是不现实的。
UPDATE: I liked this question so much, I just blogged it. See Strings, immutability and persistence
The short answer is: O(n) is O(1) if n does not grow large. Most people extract tiny substrings from tiny strings, so how the complexity grows asymptotically is completely irrelevant.
The long answer is:
An immutable data structure built such that operations on an instance permit re-use of the memory of the original with only a small amount (typically O(1) or O(lg n)) of copying or new allocation is called a "persistent" immutable data structure. Strings in .NET are immutable; your question is essentially "why are they not persistent"?
Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string. The expense and difficulty of building a complex persistent data structure doesn't pay for itself.
People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. The line will be maybe a couple hundred characters long, the name will be a couple dozen. String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; "fast enough" is by definition fast enough.
The substrings extracted are typically small in size and short in lifetime; the garbage collector is going to reclaim them soon, and they didn't take up much room on the heap in the first place. So using a persistent strategy that encourages reuse of most of the memory is also not a win; all you've done is made your garbage collector get slower because now it has to worry about handling interior pointers.
If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; it would be wasteful and foolish not to. But most line-of-business programmers do not do anything even vaguely like those sorts of things. .NET is not a platform that is tailored for the needs of the Human Genome Project; DNA analysis programmers have to solve problems with those string usage characteristics every day; odds are good that you do not. The few who do build their own persistent data structures that closely match their usage scenarios.
For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. It would be unrealistic to expect the built-in string type to solve this problem for us.
恰恰因为字符串是不可变的,
.Substring
必须复制至少一部分原始字符串。复制 n 个字节应该花费 O(n) 时间。您认为如何在恒定时间内复制一堆字节?
编辑:Mehrdad 建议根本不要复制字符串,而是保留对其中一部分的引用。
考虑在 .Net 中,一个多兆字节的字符串,有人调用
.SubString(n, n+3)
(对于字符串中间的任何 n)。现在,仅仅因为一个引用保留了 4 个字符,就不能对整个字符串进行垃圾收集?
这似乎是一种荒谬的空间浪费。
此外,跟踪对子字符串(甚至可能在子字符串内部)的引用,并尝试在最佳时间进行复制以避免击败 GC(如上所述),使这个概念成为一场噩梦。在
.SubString
上进行复制并维护简单的不可变模型要简单得多,也更可靠。编辑:这是关于在较大字符串中保留对子字符串的引用的危险,很好的小读物。
Precisely because Strings are immutable,
.Substring
must make a copy of at least a portion of the original string. Making a copy of n bytes should take O(n) time.How do you think you would copy a bunch of bytes in constant time?
EDIT: Mehrdad suggests not copying the string at all, but keeping a reference to a piece of it.
Consider in .Net, a multi-megabyte string, on which someone calls
.SubString(n, n+3)
(for any n in the middle of the string).Now, the ENTIRE string cannot be Garbage Collected just because one reference is holding on to 4 characters?
That seems like a ridiculous waste of space.
Further, tracking references to substrings (which may even be inside substrings), and trying to copy at optimal times to avoid defeating the GC (as described above), makes the concept a nightmare. It is far simpler, and more reliable, to copy on
.SubString
, and maintain the straightforward immutable model.EDIT: Here's a good little read about the danger of keeping references to substrings within larger strings.
Java(与 .NET 相对)提供了两种执行
Substring()
的方法,您可以考虑是否要仅保留引用或将整个子字符串复制到新的内存位置。简单的
.substring(...)
与原始 String 对象共享内部使用的char
数组,然后您可以使用new String(...)<如果需要, /code> 可以复制到新数组(以避免阻碍原始数组的垃圾收集)。
我认为这种灵活性对于开发人员来说是最好的选择。
Java (as opposed to .NET) provides two ways of doing
Substring()
, you can consider whether you want to keep just a reference or copy a whole substring to a new memory location.The simple
.substring(...)
shares the internally usedchar
array with the original String object, which you then withnew String(...)
can copy to a new array, if needed (to avoid hindering garbage collection of the original one).I think this kind of flexibility is a best option for a developer.
Java 过去常常引用较大的字符串,但是:
Java 将其行为更改为复制,以避免内存泄漏。
我觉得它可以改进:为什么不只是有条件地进行复制呢?
如果子字符串至少是父字符串大小的一半,则可以引用父字符串。否则就可以复制一份。这可以避免泄漏大量内存,同时仍然提供显着的好处。
Java used to reference larger strings, but:
Java changed its behavior to copying as well, to avoid leaking memory.
I feel like it can be improved though: why not just do the copying conditionally?
If the substring is at least half the size of the parent, one can reference the parent. Otherwise one can just make a copy. This avoids leaking a lot of memory while still providing a significant benefit.
这里的答案都没有解决“括号问题”,也就是说 .NET 中的字符串表示为 BStr(指针“之前”存储在内存中的长度)和 CStr(字符串以'\0')。
因此,字符串“Hello there”表示为
(如果在
fixed
语句中分配给char*
,则指针将指向0x48。)此结构允许快速查找字符串的长度(在许多上下文中很有用),并允许在 P/Invoke 中将指针传递给需要空终止字符串的 Win32(或其他)API。
当您执行
Substring(0, 5)
时,“哦,但我保证最后一个字符后会有一个空字符”规则表示您需要制作副本。即使你在末尾得到了子字符串,那么也没有地方可以放置长度而不破坏其他变量。但有时,您确实想讨论“字符串的中间”,并且您不一定关心 P/Invoke 行为。最近添加的
ReadOnlySpan
结构可用于获取非复制子字符串:ReadOnlySpan
“子字符串”独立存储长度,并且它不保证值末尾后有一个“\0”。它可以“像字符串一样”以多种方式使用,但它不是“字符串”,因为它不具有 BStr 或 CStr 特征(更不用说两者了)。如果您从不(直接)P/Invoke,那么没有太大区别(除非您要调用的 API 没有ReadOnlySpan
重载)。ReadOnlySpan
不能用作引用类型的字段,因此还有ReadOnlyMemory
(s.AsMemory(0, 5)
code>),这是拥有ReadOnlySpan
的间接方式,因此存在与string
相同的差异。之前答案的一些答案/评论谈到,当您继续谈论 5 个字符时,让垃圾收集器必须保留一百万个字符的字符串是浪费的。这正是使用
ReadOnlySpan
方法可以获得的行为。如果您只是进行简短的计算,那么 ReadOnlySpan 方法可能更好。如果您需要将其保留一段时间并且只保留原始字符串的一小部分,那么执行适当的子字符串(以修剪掉多余的数据)可能会更好。中间有一个过渡点,但这取决于您的具体用法。None of the answers here addressed "the bracketing problem", which is to say that strings in .NET are represented as a combination of a BStr (the length stored in memory "before" the pointer) and a CStr (the string ends in a '\0').
The string "Hello there" is thus represented as
(if assigned to a
char*
in afixed
-statement the pointer would point to the 0x48.)This structure allows for fast lookup of the length of a string (useful in many contexts) and allows for the pointer to be passed in a P/Invoke to Win32 (or other) APIs which expect a null-terminated string.
When you do
Substring(0, 5)
the "oh, but I promised there'd be a null-character after the last character" rule says you need to make a copy. Even if you got the substring at the end then there'd be no place to put the length without corrupting the other variables.Sometimes, though, you really do want to talk about "the middle of the string", and you don't necessarily care about the P/Invoke behavior. The recently added
ReadOnlySpan<T>
structure can be used to get a no-copy substring:The
ReadOnlySpan<char>
"substring" stores the length independently, and it does not guarantee that there's a '\0' after the end of the value. It can be used in many ways "like a string", but it is not "a string" since it doesn't have either BStr or CStr characteristics (much less both of them). If you never (directly) P/Invoke then there's not much of a difference (unless the API you want to call doesn't have aReadOnlySpan<char>
overload).ReadOnlySpan<char>
cannot be used as the field of a reference type, so there's alsoReadOnlyMemory<char>
(s.AsMemory(0, 5)
), which is an indirect way of having aReadOnlySpan<char>
, so the same differences-from-string
exist.Some of the answers/comments on previous answers talked about it being wasteful to have the garbage collector have to keep a million-character string around while you continue to talk about 5 characters. That is precisely the behavior you can get with the
ReadOnlySpan<char>
approach. If you're just doing short computations, the ReadOnlySpan approach is probably better. If you need to persist it for a while and you're going to keep only a small percentage of the original string, doing a proper substring (to trim off the excess data) is probably better. There's a transition point somewhere in the middle, but it depends on your specific usage.