我有一个生成DNA序列的类,该序列由长字符串表示。该类实现 iEnumerable< string>
接口,并且可以产生无限数量的DNA序列。以下是我的类的简化版本:
class DnaGenerator : IEnumerable<string>
{
private readonly IEnumerable<string> _enumerable;
public DnaGenerator() => _enumerable = Iterator();
private IEnumerable<string> Iterator()
{
while (true)
foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
yield return new String(c, 10_000_000);
}
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
此类使用 iterator 。与其一次又一次地调用迭代器,不如在构造过程中创建 iEnumerable&gt;
实例,并作为私有字段被缓存。问题在于,使用此类会导致不断分配的大量内存,而垃圾收集器无法回收这一部分。这是此行为的最小证明:
var dnaGenerator = new DnaGenerator();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
DoWork(dnaGenerator);
GC.Collect();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
GC.KeepAlive(dnaGenerator);
static void DoWork(DnaGenerator dnaGenerator)
{
foreach (string dna in dnaGenerator.Take(5))
{
Console.WriteLine($"Processing DNA of {dna.Length:#,0} nucleotides" +
$", starting from {dna[0]}");
}
}
输出:
TotalMemory: 84,704 bytes
Processing DNA of 10,000,000 nucleotides, starting from A
Processing DNA of 10,000,000 nucleotides, starting from C
Processing DNA of 10,000,000 nucleotides, starting from G
Processing DNA of 10,000,000 nucleotides, starting from T
Processing DNA of 10,000,000 nucleotides, starting from A
TotalMemory: 20,112,680 bytes
在小提琴上尝试一下。
我的期望是,所有产生的DNA序列都有资格获得垃圾收集,因为我的程序没有引用它们。我持有的唯一参考是对 dnagenerator
实例本身的引用,该实例本身并不包含任何序列。此组件仅生成序列。但是,无论我的程序生成多少个序列,在完整的垃圾收集后总是分配了大约20 MB的内存。
我的问题是:为什么会发生这种情况?我该如何防止这种情况发生?
.NET 6.0,Windows 10,64位操作系统,基于X64的处理器,构建。
更新:如果我替换此问题:
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
...但是,
public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();
我不喜欢每次需要枚举时创建一个新的枚举。我的理解是,一个 可以创建许多 ienumerator&lt; t&gt;
s。 AFAIK这两个接口并不意味着具有一对一的关系。
I have a class that generates DNA sequences, that are represented by long strings. This class implements the IEnumerable<string>
interface, and it can produce an infinite number of DNA sequences. Below is a simplified version of my class:
class DnaGenerator : IEnumerable<string>
{
private readonly IEnumerable<string> _enumerable;
public DnaGenerator() => _enumerable = Iterator();
private IEnumerable<string> Iterator()
{
while (true)
foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
yield return new String(c, 10_000_000);
}
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
This class generates the DNA sequences by using an iterator. Instead of invoking the iterator again and again, an IEnumerable<string>
instance is created during the construction and is cached as a private field. The problem is that using this class results in a sizable chunk of memory being constantly allocated, with the garbage collector being unable to recycle this chunk. Here is a minimal demonstration of this behavior:
var dnaGenerator = new DnaGenerator();
Console.WriteLine(quot;TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
DoWork(dnaGenerator);
GC.Collect();
Console.WriteLine(quot;TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
GC.KeepAlive(dnaGenerator);
static void DoWork(DnaGenerator dnaGenerator)
{
foreach (string dna in dnaGenerator.Take(5))
{
Console.WriteLine(quot;Processing DNA of {dna.Length:#,0} nucleotides" +
quot;, starting from {dna[0]}");
}
}
Output:
TotalMemory: 84,704 bytes
Processing DNA of 10,000,000 nucleotides, starting from A
Processing DNA of 10,000,000 nucleotides, starting from C
Processing DNA of 10,000,000 nucleotides, starting from G
Processing DNA of 10,000,000 nucleotides, starting from T
Processing DNA of 10,000,000 nucleotides, starting from A
TotalMemory: 20,112,680 bytes
Try it on Fiddle.
My expectation was that all generated DNA sequences would be eligible for garbage collection, since they are not referenced by my program. The only reference that I hold is the reference to the DnaGenerator
instance itself, which is not meant to contain any sequences. This component just generates the sequences. Nevertheless, no matter how many or how few sequences my program generates, there are always around 20 MB of memory allocated after a full garbage collection.
My question is: Why is this happening? And how can I prevent this from happening?
.NET 6.0, Windows 10, 64-bit operating system, x64-based processor, Release built.
Update: The problem disappears if I replace this:
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
...with this:
public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();
But I am not a fan of creating a new enumerable each time an enumerator is needed. My understanding is that a single IEnumerable<T>
can create many IEnumerator<T>
s. AFAIK these two interfaces are not meant to have an one-to-one relationship.
发布评论
评论(4)
问题是由使用
yart
的自动生成的实现引起的。您可以通过明确实施枚举器来在某种程度上减轻这种情况。
您必须通过调用
.reset()
从public ienumerator&lt; string&gt; getEnumerator()
以确保每次呼叫的枚举重新启动:The problem is caused by the auto-generated implementation for the code using
yield
.You can mitigate this somewhat by explicitly implementing the enumerator.
You have to fiddle it a bit by calling
.Reset()
frompublic IEnumerator<string> GetEnumerator()
to ensure the enumeration restarts at each call:请注意,10_000_000的字符(16位)将花费约20 MB。 If you will take a look at the
decompilation
您会注意到,Yeild返回
在内部&lt; iterator&gt;
中生成的类中生成的类,该类又具有当前
字段来存储字符串(到实现iEnumerator&lt; string&gt; .current
):和
iterator
内部将汇编到类似的东西:这会导致当前字符串始终存储在内存中的
内存中_enumerable.getEnumerator();
实现(迭代启动之后)时,dnagenerator
实例不是本身。upd
是的,如果生成
收益率返回
枚举可以创建多个枚举者,但是在这种特殊情况下,实现具有“一对一”关系,因为生成的实现都是iEnumerable < /code>和
ienumerator
:但这实际上是当您调用
_enumerable.getEnumerator()
(显然是实现细节)时,实际上正在发生的事情,如果您检查已经提到的解说,您会看到_enumosert = iterator()< /code>实际上是
new&lt; iterator&gt; d__2(-2)
and&lt; iterator&gt; d_.getEnumerator()
看起来像这样:因此它应该创建一个除第一个枚举外,每次新迭代器实例,因此您的
public ienumerator&lt; string&gt; getEnumerator()=&gt; iterator()。getEnumerator();
方法很好。Note that 10_000_000 of chars (which are 16 bit) will take approximately 20 MB. If you will take a look at the
decompilation
you will notice thatyeild return
results in internal<Iterator>
class generated which in turn has acurrent
field to store the string (to implementIEnumerator<string>.Current
):And
Iterator
method internally will be compiled to something like this:Which leads to the current string always being stored in memory for
_enumerable.GetEnumerator();
implementation (after iteration start) whileDnaGenerator
instance is not GCed itself.UPD
Yes, in case of generated for
yield return
enumerable it can create multiple enumerators, but in this particular case the implementation have "one-to-one" relationship because the generated implementation is bothIEnumerable
andIEnumerator
:But it is actually what is happening when you call
_enumerable.GetEnumerator()
(which is obviously an implementation detail), if you check already mentioned decompilation you will see that_enumerable = Iterator()
is actuallynew <Iterator>d__2(-2)
and<Iterator>d__2.GetEnumerator()
looks something like this:So it actually should create a new iterator instance every time except the first enumeration, so your
public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();
approach is just fine.如果记忆使用(或速度)是一个问题,则您也可能(也)想使用字节(或ints)一次代表4个核苷酸。考虑到您与我们分享的内容,情况可能是这种情况。
If memory usage (or speed) is an concern, you might (also) want to use bytes (or ints) to represent 4 nucleotides at once. Given what you shared with us, that might be the case.
@gurustron的答案证明了我在这里提出的问题是我对C#迭代者的浅薄理解和of它们如何在内部实施。通过在我的
dnagenerator
实例中存储iEnumerable&lt; string&gt;
,我基本上没有获得。当请求枚举器时,两行以下都会分配一个对象。这是具有双重个性的自动化物体。这既是iEnumerable&lt; string&gt;
,又是ienumerator&lt; string&gt;
。通过在字段中存储
_Enumerable
,我只是阻止此对象被回收。尽管如此,我仍在寻找解决此非问题的方法,以一种使我能够保持缓存的
_Enumerable
字段的方式,而不会导致内存泄漏,而无需诉诸于实现完整的壁虎代码> ienumerable&lt; string&gt; 从划痕中 @matthewwatson的答案。我发现的解决方法是在strongbox&lt; string&gt;
包装器:然后,我必须
unwrap
迭代器在将其暴露于外部世界之前:这是
unawrap
unwrap < /code>扩展方法:
诀窍是跟踪由枚举者发出的最新
strongbox&lt; t&gt;
,并设置其value
todef> default
枚举者处置时。实时演示。
@GuruStron's answer demonstrated that the problem that I've presented here was created by my shallow understanding of the C# iterators, and of how they are implemented internally. By storing an
IEnumerable<string>
in myDnaGenerator
instances, I am gaining essentially nothing. When an enumerator is requested, both lines below result in allocating a single object. It's an autogenerated object with dual personality. It is both anIEnumerable<string>
, and anIEnumerator<string>
.By storing the
_enumerable
in a field I am just preventing this object from getting recycled.Nevertheless I am still searching for ways to solve this non-issue, in a way that would allow me to keep the cached
_enumerable
field, without causing a memory leak, and without resorting to implementing a full fledgedIEnumerable<string>
from scratch as shown in @MatthewWatson's answer. The workaround that I found is to wrap my generated DNA sequences inStrongBox<string>
wrappers:Then I have to
Unwrap
the iterator before exposing it to the external world:Here is the
Unwrap
extension method:The trick is to keep track of the latest
StrongBox<T>
that has been emitted by the enumerator, and set itsValue
todefault
when the enumerator is disposed.Live demo.