由C#迭代器泄漏的托管内存

发布于 2025-02-04 07:00:17 字数 2982 浏览 1 评论 0 原文

我有一个生成DNA序列的类,该序列由长字符串表示。该类实现 iEnumerable< string> 接口,并且可以产生无限数量的DNA序列。以下是我的类的简化版本:

class DnaGenerator : IEnumerable<string>
{
    private readonly IEnumerable<string> _enumerable;

    public DnaGenerator() => _enumerable = Iterator();

    private IEnumerable<string> Iterator()
    {
        while (true)
            foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
                yield return new String(c, 10_000_000);
    }

    public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

此类使用 iterator 。与其一次又一次地调用迭代器,不如在构造过程中创建 iEnumerable&gt; 实例,并作为私有字段被缓存。问题在于,使用此类会导致不断分配的大量内存,而垃圾收集器无法回收这一部分。这是此行为的最小证明:

var dnaGenerator = new DnaGenerator();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
DoWork(dnaGenerator);
GC.Collect();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
GC.KeepAlive(dnaGenerator);

static void DoWork(DnaGenerator dnaGenerator)
{
    foreach (string dna in dnaGenerator.Take(5))
    {
        Console.WriteLine($"Processing DNA of {dna.Length:#,0} nucleotides" +
            $", starting from {dna[0]}");
    }
}

输出:

TotalMemory: 84,704 bytes
Processing DNA of 10,000,000 nucleotides, starting from A
Processing DNA of 10,000,000 nucleotides, starting from C
Processing DNA of 10,000,000 nucleotides, starting from G
Processing DNA of 10,000,000 nucleotides, starting from T
Processing DNA of 10,000,000 nucleotides, starting from A
TotalMemory: 20,112,680 bytes

在小提琴上尝试一下

我的期望是,所有产生的DNA序列都有资格获得垃圾收集,因为我的程序没有引用它们。我持有的唯一参考是对 dnagenerator 实例本身的引用,该实例本身并不包含任何序列。此组件仅生成序列。但是,无论我的程序生成多少个序列,在完整的垃圾收集后总是分配了大约20 MB的内存。

我的问题是:为什么会发生这种情况?我该如何防止这种情况发生?

.NET 6.0,Windows 10,64位操作系统,基于X64的处理器,构建。


更新:如果我替换此问题:

public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();

...但是,

public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();

我不喜欢每次需要枚举时创建一个新的枚举。我的理解是,一个 可以创建许多 ienumerator&lt; t&gt; s。 AFAIK这两个接口并不意味着具有一对一的关系。

I have a class that generates DNA sequences, that are represented by long strings. This class implements the IEnumerable<string> interface, and it can produce an infinite number of DNA sequences. Below is a simplified version of my class:

class DnaGenerator : IEnumerable<string>
{
    private readonly IEnumerable<string> _enumerable;

    public DnaGenerator() => _enumerable = Iterator();

    private IEnumerable<string> Iterator()
    {
        while (true)
            foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
                yield return new String(c, 10_000_000);
    }

    public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

This class generates the DNA sequences by using an iterator. Instead of invoking the iterator again and again, an IEnumerable<string> instance is created during the construction and is cached as a private field. The problem is that using this class results in a sizable chunk of memory being constantly allocated, with the garbage collector being unable to recycle this chunk. Here is a minimal demonstration of this behavior:

var dnaGenerator = new DnaGenerator();
Console.WriteLine(
quot;TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
DoWork(dnaGenerator);
GC.Collect();
Console.WriteLine(
quot;TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
GC.KeepAlive(dnaGenerator);

static void DoWork(DnaGenerator dnaGenerator)
{
    foreach (string dna in dnaGenerator.Take(5))
    {
        Console.WriteLine(
quot;Processing DNA of {dna.Length:#,0} nucleotides" +
            
quot;, starting from {dna[0]}");
    }
}

Output:

TotalMemory: 84,704 bytes
Processing DNA of 10,000,000 nucleotides, starting from A
Processing DNA of 10,000,000 nucleotides, starting from C
Processing DNA of 10,000,000 nucleotides, starting from G
Processing DNA of 10,000,000 nucleotides, starting from T
Processing DNA of 10,000,000 nucleotides, starting from A
TotalMemory: 20,112,680 bytes

Try it on Fiddle.

My expectation was that all generated DNA sequences would be eligible for garbage collection, since they are not referenced by my program. The only reference that I hold is the reference to the DnaGenerator instance itself, which is not meant to contain any sequences. This component just generates the sequences. Nevertheless, no matter how many or how few sequences my program generates, there are always around 20 MB of memory allocated after a full garbage collection.

My question is: Why is this happening? And how can I prevent this from happening?

.NET 6.0, Windows 10, 64-bit operating system, x64-based processor, Release built.


Update: The problem disappears if I replace this:

public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();

...with this:

public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();

But I am not a fan of creating a new enumerable each time an enumerator is needed. My understanding is that a single IEnumerable<T> can create many IEnumerator<T>s. AFAIK these two interfaces are not meant to have an one-to-one relationship.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

时光无声 2025-02-11 07:00:17

问题是由使用 yart 的自动生成的实现引起的。

您可以通过明确实施枚举器来在某种程度上减轻这种情况。

您必须通过调用 .reset() public ienumerator&lt; string&gt; getEnumerator()以确保每次呼叫的枚举重新启动:

class DnaGenerator : IEnumerable<string>
{
    private readonly IEnumerator<string> _enumerable;

    public DnaGenerator() => _enumerable = new IteratorImpl();

    sealed class IteratorImpl : IEnumerator<string>
    {
        public bool MoveNext()
        {
            return true; // Infinite sequence.
        }

        public void Reset()
        {
            _index = 0;
        }

        public string Current
        {
            get
            {
                var result = new String(_data[_index], 10_000_000);

                if (++_index >= _data.Length)
                    _index = 0;

                return result;
            }
        }

        public void Dispose()
        {
            // Nothing to do.
        }

        readonly char[] _data = { 'A', 'C', 'G', 'T' };

        int _index;

        object IEnumerator.Current => Current;
    }

    public IEnumerator<string> GetEnumerator()
    {
        _enumerable.Reset();
        return _enumerable;
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

The problem is caused by the auto-generated implementation for the code using yield.

You can mitigate this somewhat by explicitly implementing the enumerator.

You have to fiddle it a bit by calling .Reset() from public IEnumerator<string> GetEnumerator() to ensure the enumeration restarts at each call:

class DnaGenerator : IEnumerable<string>
{
    private readonly IEnumerator<string> _enumerable;

    public DnaGenerator() => _enumerable = new IteratorImpl();

    sealed class IteratorImpl : IEnumerator<string>
    {
        public bool MoveNext()
        {
            return true; // Infinite sequence.
        }

        public void Reset()
        {
            _index = 0;
        }

        public string Current
        {
            get
            {
                var result = new String(_data[_index], 10_000_000);

                if (++_index >= _data.Length)
                    _index = 0;

                return result;
            }
        }

        public void Dispose()
        {
            // Nothing to do.
        }

        readonly char[] _data = { 'A', 'C', 'G', 'T' };

        int _index;

        object IEnumerator.Current => Current;
    }

    public IEnumerator<string> GetEnumerator()
    {
        _enumerable.Reset();
        return _enumerable;
    }

    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
忆依然 2025-02-11 07:00:17

请注意,10_000_000的字符(16位)将花费约20 MB。 If you will take a look at the decompilation您会注意到, Yeild返回在内部&lt; iterator&gt; 中生成的类中生成的类,该类又具有当前字段来存储字符串(到实现 iEnumerator&lt; string&gt; .current ):

[CompilerGenerated]
private sealed class <Iterator>d__2 : IEnumerable<string>, IEnumerable, IEnumerator<string>, IEnumerator, IDisposable
{
​    ...
    private string <>2__current;
    ...
}

iterator 内部将汇编到类似的东西:

[IteratorStateMachine(typeof(<Iterator>d__2))]
private IEnumerable<string> Iterator()
{
    return new <Iterator>d__2(-2);
}

这会导致当前字符串始终存储在内存中的内存中_enumerable.getEnumerator(); 实现(迭代启动之后)时, dnagenerator 实例不是本身。

upd

我的理解是,一个ienumerable可以创建许多ienumerator。 afaik这两个接口并不意味着有一对一的关系。

是的,如果生成收益率返回枚举可以创建多个枚举者,但是在这种特殊情况下,实现具有“一对一”关系,因为生成的实现都是 iEnumerable < /code>和 ienumerator

private sealed class <Iterator>d__2 : 
    IEnumerable<string>, IEnumerable,
    IEnumerator<string>, IEnumerator, 
    IDisposable

,但我不喜欢每次需要枚举时创建一个新的枚举。

但这实际上是当您调用 _enumerable.getEnumerator()(显然是实现细节)时,实际上正在发生的事情,如果您检查已经提到的解说,您会看到 _enumosert = iterator()< /code>实际上是 new&lt; iterator&gt; d__2(-2) and &lt; iterator&gt; d_.getEnumerator()看起来像这样:

IEnumerator<string> IEnumerable<string>.GetEnumerator()
{
    if (<>1__state == -2 && <>l__initialThreadId == Environment.CurrentManagedThreadId)
    {
        <>1__state = 0;
        return this;
    }
    return new <Iterator>d__2(0);
}

因此它应该创建一个除第一个枚举外,每次新迭代器实例,因此您的 public ienumerator&lt; string&gt; getEnumerator()=&gt; iterator()。getEnumerator(); 方法很好。

Note that 10_000_000 of chars (which are 16 bit) will take approximately 20 MB. If you will take a look at the decompilation you will notice that yeild return results in internal <Iterator> class generated which in turn has a current field to store the string (to implement IEnumerator<string>.Current):

[CompilerGenerated]
private sealed class <Iterator>d__2 : IEnumerable<string>, IEnumerable, IEnumerator<string>, IEnumerator, IDisposable
{
​    ...
    private string <>2__current;
    ...
}

And Iterator method internally will be compiled to something like this:

[IteratorStateMachine(typeof(<Iterator>d__2))]
private IEnumerable<string> Iterator()
{
    return new <Iterator>d__2(-2);
}

Which leads to the current string always being stored in memory for _enumerable.GetEnumerator(); implementation (after iteration start) while DnaGenerator instance is not GCed itself.

UPD

My understanding is that a single IEnumerable can create many IEnumerators. AFAIK these two interfaces are not meant to have an one-to-one relationship.

Yes, in case of generated for yield return enumerable it can create multiple enumerators, but in this particular case the implementation have "one-to-one" relationship because the generated implementation is both IEnumerable and IEnumerator:

private sealed class <Iterator>d__2 : 
    IEnumerable<string>, IEnumerable,
    IEnumerator<string>, IEnumerator, 
    IDisposable

But I am not a fan of creating a new enumerable each time an enumerator is needed.

But it is actually what is happening when you call _enumerable.GetEnumerator() (which is obviously an implementation detail), if you check already mentioned decompilation you will see that _enumerable = Iterator() is actually new <Iterator>d__2(-2) and <Iterator>d__2.GetEnumerator() looks something like this:

IEnumerator<string> IEnumerable<string>.GetEnumerator()
{
    if (<>1__state == -2 && <>l__initialThreadId == Environment.CurrentManagedThreadId)
    {
        <>1__state = 0;
        return this;
    }
    return new <Iterator>d__2(0);
}

So it actually should create a new iterator instance every time except the first enumeration, so your public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator(); approach is just fine.

怎会甘心 2025-02-11 07:00:17

如果记忆使用(或速度)是一个问题,则您也可能(也)想使用字节(或ints)一次代表4个核苷酸。考虑到您与我们分享的内容,情况可能是这种情况。

If memory usage (or speed) is an concern, you might (also) want to use bytes (or ints) to represent 4 nucleotides at once. Given what you shared with us, that might be the case.

白况 2025-02-11 07:00:17

@gurustron的答案证明了我在这里提出的问题是我对C#迭代者的浅薄理解和of它们如何在内部实施。通过在我的 dnagenerator 实例中存储 iEnumerable&lt; string&gt; ,我基本上没有获得。当请求枚举器时,两行以下都会分配一个对象。这是具有双重个性的自动化物体。这既是 iEnumerable&lt; string&gt; ,又是 ienumerator&lt; string&gt;

public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();

public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();

通过在字段中存储 _Enumerable ,我只是阻止此对象被回收。

尽管如此,我仍在寻找解决此非问题的方法,以一种使我能够保持缓存的 _Enumerable 字段的方式,而不会导致内存泄漏,而无需诉诸于实现完整的壁虎代码> ienumerable&lt; string&gt; 从划痕中 @matthewwatson的答案。我发现的解决方法是在 strongbox&lt; string&gt; 包装器:

private IEnumerable<StrongBox<string>> Iterator()
{
    while (true)
        foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
            yield return new(new String(c, 10_000_000));
}

然后,我必须 unwrap 迭代器在将其暴露于外部世界之前:

private readonly IEnumerable<string> _enumerable;

public DnaGenerator() => _enumerable = Iterator().Unwrap();

这是 unawrap unwrap < /code>扩展方法:

/// <summary>
/// Unwraps an enumerable sequence that contains values wrapped in StrongBox instances.
/// The latest StrongBox instance is emptied when the enumerator is disposed.
/// </summary>
public static IEnumerable<T> Unwrap<T>(this IEnumerable<StrongBox<T>> source)
    => new StrongBoxUnwrapper<T>(source);

private class StrongBoxUnwrapper<T> : IEnumerable<T>
{
    private readonly IEnumerable<StrongBox<T>> _source;
    public StrongBoxUnwrapper(IEnumerable<StrongBox<T>> source)
    {
        ArgumentNullException.ThrowIfNull(source);
        _source = source;
    }
    public IEnumerator<T> GetEnumerator() => new Enumerator(_source.GetEnumerator());
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

    private class Enumerator : IEnumerator<T>
    {
        private readonly IEnumerator<StrongBox<T>> _source;
        private StrongBox<T> _latest;
        public Enumerator(IEnumerator<StrongBox<T>> source)
        {
            ArgumentNullException.ThrowIfNull(source);
            _source = source;
        }
        public T Current => _source.Current.Value;
        object IEnumerator.Current => Current;
        public bool MoveNext()
        {
            var moved = _source.MoveNext();
            _latest = _source.Current;
            return moved;
        }
        public void Dispose()
        {
            _source.Dispose();
            if (_latest is not null) _latest.Value = default;
        }
        public void Reset() => _source.Reset();
    }
}

诀窍是跟踪由枚举者发出的最新 strongbox&lt; t&gt; ,并设置其 value to def> default 枚举者处置时。

实时演示

@GuruStron's answer demonstrated that the problem that I've presented here was created by my shallow understanding of the C# iterators, and of how they are implemented internally. By storing an IEnumerable<string> in my DnaGenerator instances, I am gaining essentially nothing. When an enumerator is requested, both lines below result in allocating a single object. It's an autogenerated object with dual personality. It is both an IEnumerable<string>, and an IEnumerator<string>.

public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();

public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();

By storing the _enumerable in a field I am just preventing this object from getting recycled.

Nevertheless I am still searching for ways to solve this non-issue, in a way that would allow me to keep the cached _enumerable field, without causing a memory leak, and without resorting to implementing a full fledged IEnumerable<string> from scratch as shown in @MatthewWatson's answer. The workaround that I found is to wrap my generated DNA sequences in StrongBox<string> wrappers:

private IEnumerable<StrongBox<string>> Iterator()
{
    while (true)
        foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
            yield return new(new String(c, 10_000_000));
}

Then I have to Unwrap the iterator before exposing it to the external world:

private readonly IEnumerable<string> _enumerable;

public DnaGenerator() => _enumerable = Iterator().Unwrap();

Here is the Unwrap extension method:

/// <summary>
/// Unwraps an enumerable sequence that contains values wrapped in StrongBox instances.
/// The latest StrongBox instance is emptied when the enumerator is disposed.
/// </summary>
public static IEnumerable<T> Unwrap<T>(this IEnumerable<StrongBox<T>> source)
    => new StrongBoxUnwrapper<T>(source);

private class StrongBoxUnwrapper<T> : IEnumerable<T>
{
    private readonly IEnumerable<StrongBox<T>> _source;
    public StrongBoxUnwrapper(IEnumerable<StrongBox<T>> source)
    {
        ArgumentNullException.ThrowIfNull(source);
        _source = source;
    }
    public IEnumerator<T> GetEnumerator() => new Enumerator(_source.GetEnumerator());
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

    private class Enumerator : IEnumerator<T>
    {
        private readonly IEnumerator<StrongBox<T>> _source;
        private StrongBox<T> _latest;
        public Enumerator(IEnumerator<StrongBox<T>> source)
        {
            ArgumentNullException.ThrowIfNull(source);
            _source = source;
        }
        public T Current => _source.Current.Value;
        object IEnumerator.Current => Current;
        public bool MoveNext()
        {
            var moved = _source.MoveNext();
            _latest = _source.Current;
            return moved;
        }
        public void Dispose()
        {
            _source.Dispose();
            if (_latest is not null) _latest.Value = default;
        }
        public void Reset() => _source.Reset();
    }
}

The trick is to keep track of the latest StrongBox<T> that has been emitted by the enumerator, and set its Value to default when the enumerator is disposed.

Live demo.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文