RegEx、StringBuilder 和大对象堆碎片

发布于 2024-12-13 20:28:07 字数 896 浏览 5 评论 0 原文

如何在大字符串中运行大量正则表达式(以查找匹配项)而不导致 LOH 碎片?

它是 .NET Framework 4.0,所以我使用 StringBuilder,因此它不在 LOH 中,但是一旦我需要在其上运行 RegEx,我就必须调用 StringBuilder.ToString() code> 这意味着它将位于 LOH 中。

这个问题有什么解决办法吗?实际上不可能有一个长时间运行的应用程序来处理这样的大字符串和正则表达式。

解决这个问题的想法:

在思考这个问题时,我想我找到了一个肮脏的解决方案。

在给定时间我只有 5 个字符串,这 5 个字符串(大于 85KB)将被传递给 RegEx.Match

由于由于新对象无法容纳 LOH 中的空白空间而产生碎片,因此这应该可以解决问题:

  1. PadRight 将所有字符串限制为最大值。接受的大小,比方说 1024KB(我可能需要使用 StringBuider 来做到这一点)
  2. 通过这样做,所有新字符串都将适合已经清空的内存,因为先前的字符串已经超出范围
  3. 不会有任何碎片因为对象大小始终相同,因此我只会在给定时间分配 1024*5,并且 LOH 中的这些空间将在这些字符串之间共享。

我想这个设计的最大问题是,如果其他大对象在 LOH 中分配这个位置,会发生什么,这会导致应用程序分配大量 1024 KB 字符串,可能会产生更糟糕的碎片。 fixed 语句可能会有所帮助,但是如何将固定字符串发送到 RegEx,而不实际创建一个不在固定内存地址中的新字符串?

关于这个理论有什么想法吗? (不幸的是,我无法轻松重现问题,我通常尝试使用内存分析器来观察变化,并且不确定我可以为此编写什么样的隔离测试用例)

How can I run lots of RegExes (to find matches) in big strings without causing LOH fragmentation?

It's .NET Framework 4.0 so I'm using StringBuilder so it's not in the LOH however as soon as I need to run a RegEx on it I have to call StringBuilder.ToString() which means it'll be in the LOH.

Is there any solution to this problem? It's virtually impossible to have a long running application that deals with big strings and RegExes like this.

An Idea to Solve this problem:

While thinking about this problem, I think I found a dirty solution.

At a given time I only have 5 strings and these 5 strings (bigger than 85KB) will be passed to RegEx.Match.

Since the fragmentation occurs because new objects won't fit to empty spaces in LOH, this should solve the problem:

  1. PadRight all strings to a max. accepted size, let's say 1024KB (I might need to do this with StringBuider)
  2. By doing so all new strings will fit to already emptied memory as previous string is already out of scope
  3. There won't be any fragmentation because object size is always same hence I'll only allocate 1024*5 at a given time, and these space in LOH will be shared between these strings.

I suppose the biggest problem with this design what happens if other big objects allocate this location in LOH which would cause application to allocate lots of 1024 KB strings maybe with an even worse fragmentation. fixed statement might help however how can I send a fixed string to RegEx without actually create a new string which is not located in a fixed memory address?

Any ideas about this theory? (Unfortunately I can't reproduce the problem easily, I'm generally trying to use a memory profiler to observe the changes and not sure what kind of isolated test case I can write for this)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

余厌 2024-12-20 20:28:07

好的,这是我尝试以相当通用的方式解决这个问题,但有一些明显的限制。由于我没有在任何地方看到过这个建议,而且每个人都在抱怨 LOH 碎片,所以我想分享代码以确认我的设计和假设是正确的。

理论:

  1. 创建一个共享的大型 StringBuilder (这是为了存储我们从流中读取的大字符串) - new StringBuilder(ChunkSize * 5);
  2. 创建一个大型 String (必须大于最大可接受的大小),应使用空白空间进行初始化。 - 新字符串(' ', ChunkSize * 10);
  3. 将字符串对象固定到内存中,这样 GC 就不会弄乱它。 GCHandle.Alloc(pinnedText, GCHandleType.Pinned)。尽管 LOH 对象通常是固定的,但这似乎可以提高性能。可能是因为不安全代码
  4. 将流读入共享StringBuilder,然后使用索引器不安全地将其复制到pinnedText
  5. 将pinnedText传递给RegEx

使用此实现,下面的代码就像没有LOH分配一样工作。如果我切换到 new string(' ') 分配而不是使用静态 StringBuilder 或使用 StringBuilder.ToString() 代码可以分配

我还使用内存分析器确认了结果,即此实现中不存在 LOH 碎片。我仍然不明白为什么 RegEx 不会引起任何意外的问题。我还使用不同且昂贵的正则表达式模式进行了测试,结果是相同的,没有碎片。

代码:

http://pastebin.com/ZuuBUXk3

using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text;
using System.Text.RegularExpressions;

namespace LOH_RegEx
{
    internal class Program
    {
        private static List<string> storage = new List<string>();
        private const int ChunkSize = 100000;
        private static StringBuilder _sb = new StringBuilder(ChunkSize * 5);


        private static void Main(string[] args)
        {
            var pinnedText = new string(' ', ChunkSize * 10);
            var sourceCodePin = GCHandle.Alloc(pinnedText, GCHandleType.Pinned);

            var rgx = new Regex("A", RegexOptions.CultureInvariant | RegexOptions.Compiled);

            try
            {

                for (var i = 0; i < 30000; i++)
                {                   
                    //Simulate that we read data from stream to SB
                    UpdateSB(i);
                    CopyInto(pinnedText);                   
                    var rgxMatch = rgx.Match(pinnedText);

                    if (!rgxMatch.Success)
                    {
                        Console.WriteLine("RegEx failed!");
                        Console.ReadLine();
                    }

                    //Extra buffer to fragment LoH
                    storage.Add(new string('z', 50000));
                    if ((i%100) == 0)
                    {
                        Console.Write(i + ",");
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                Console.WriteLine("OOM Crash!");
                Console.ReadLine();
            }
        }


        private static unsafe void CopyInto(string text)
        {
            fixed (char* pChar = text)
            {
                int i;
                for (i = 0; i < _sb.Length; i++)
                {
                    pChar[i] = _sb[i];
                }

                pChar[i + 1] = '\0';
            }
        }

        private static void UpdateSB(int extraSize)
        {
            _sb.Remove(0,_sb.Length);

            var rnd = new Random();
            for (var i = 0; i < ChunkSize + extraSize; i++)
            {
                _sb.Append((char)rnd.Next(60, 80));
            }
        }
    }
}

OK, here is my attempt solve this problem in a fairly generic way but with some obvious limitations. Since I haven't seen this advice anywhere and everyone is whining about LOH Fragmentation I wanted to share the code to confirm that my design and assumptions are correct.

Theory:

  1. Create a shared massive StringBuilder (this is to store the big strings that read from we read from streams) - new StringBuilder(ChunkSize * 5);
  2. Create a massive String (has to be bigger than max. accepted size), should be initialized with empty space. - new string(' ', ChunkSize * 10);
  3. Pin string object to memory so GC will not mess with it. GCHandle.Alloc(pinnedText, GCHandleType.Pinned). Even though LOH objects are normally pinned this seems to improve the performance. Maybe because of unsafe code
  4. Read stream into shared StringBuilder and then unsafe copy it to pinnedText by using indexers
  5. Pass the pinnedText to RegEx

With this implementation the code below works just like there is no LOH allocation. If I switch to new string(' ') allocations instead of using a static StringBuilder or use StringBuilder.ToString() code can allocate 300% less memory before crashing with outofmemory exception

I also confirmed the results with a memory profiler, that there is no LOH fragmentation in this implementation. I still don't understand why RegEx doesn't cause any unexpected problems. I also tested with different and expensive RegEx patterns and results are same, no fragmentation.

Code:

http://pastebin.com/ZuuBUXk3

using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text;
using System.Text.RegularExpressions;

namespace LOH_RegEx
{
    internal class Program
    {
        private static List<string> storage = new List<string>();
        private const int ChunkSize = 100000;
        private static StringBuilder _sb = new StringBuilder(ChunkSize * 5);


        private static void Main(string[] args)
        {
            var pinnedText = new string(' ', ChunkSize * 10);
            var sourceCodePin = GCHandle.Alloc(pinnedText, GCHandleType.Pinned);

            var rgx = new Regex("A", RegexOptions.CultureInvariant | RegexOptions.Compiled);

            try
            {

                for (var i = 0; i < 30000; i++)
                {                   
                    //Simulate that we read data from stream to SB
                    UpdateSB(i);
                    CopyInto(pinnedText);                   
                    var rgxMatch = rgx.Match(pinnedText);

                    if (!rgxMatch.Success)
                    {
                        Console.WriteLine("RegEx failed!");
                        Console.ReadLine();
                    }

                    //Extra buffer to fragment LoH
                    storage.Add(new string('z', 50000));
                    if ((i%100) == 0)
                    {
                        Console.Write(i + ",");
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                Console.WriteLine("OOM Crash!");
                Console.ReadLine();
            }
        }


        private static unsafe void CopyInto(string text)
        {
            fixed (char* pChar = text)
            {
                int i;
                for (i = 0; i < _sb.Length; i++)
                {
                    pChar[i] = _sb[i];
                }

                pChar[i + 1] = '\0';
            }
        }

        private static void UpdateSB(int extraSize)
        {
            _sb.Remove(0,_sb.Length);

            var rnd = new Random();
            for (var i = 0; i < ChunkSize + extraSize; i++)
            {
                _sb.Append((char)rnd.Next(60, 80));
            }
        }
    }
}
泛滥成性 2024-12-20 20:28:07

您可以在某个时间点卸载的 AppDomain 中完成您的工作吗?

You can do your job in an AppDomain that is unloaded at some points in time?

何以畏孤独 2024-12-20 20:28:07

一种替代方法是找到某种方法在基于非数组的数据结构上执行正则表达式匹配。不幸的是,快速的谷歌并没有在基于流的正则表达式库方面提出太多建议。我猜想 reg-ex 算法需要进行大量的回溯,这是流不支持的。

您绝对需要正则表达式的全部功能吗?您是否可以实现自己的更简单的搜索函数,该函数可以在 85kb 以下的字符串链接列表上工作?

此外,如果您长时间保留大对象引用,LOH 碎片只会真正导致问题。如果你不断地创建和销毁它们,那么 LOH 就不会增长。

FWIW,我找到了 RedGate ANTS 内存分析器 非常擅长追踪 LOH 中的对象和碎片级别。

One alternative would be to find some way of performing reg-ex matches on a non-array based data structure. Unfortunately, a quick Google didn't bring up much in terms of stream based reg-ex libraries. I would guess that the reg-ex algorithm would need to do a lot of back tracking, which isn't supported by streams.

Do you absolutely require the full power of regular expressions? Could you perhaps implement your own simpler search functions that could work on linked lists of strings all under 85kb?

Also, LOH fragmentation only really causes issues if you hold on to the large object references for long periods. If you're constantly creating and destroying them, the LOH shouldn't grow.

FWIW, I've fount the RedGate ANTS memory profiler very good at tracking down objects in the LOH and levels of fragmentation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文