使用 ThreadStatic 来替换昂贵的本地变量——好主意吗?

发布于 2024-10-15 02:07:50 字数 1341 浏览 8 评论 0原文

更新:正如我所料,社区针对这个问题给出的合理建议是“衡量一下然后看看”。 chibacity 发布了一个答案,其中包含一些非常好的内容为我做这件事的测试;与此同时,我自己写了一个测试;我看到的性能差异实际上是如此巨大,以至于 我感到有必要写一篇关于它的博客文章。

但是,我也应该承认Hans 的解释ThreadStatic 属性确实不是免费的,实际上依赖于 CLR 辅助方法来发挥其魔力。这使得它是否是适用于任何任意情况的适当优化并不明显。

对我来说好消息是,就我的情况而言,它似乎取得了很大的进步。


我有一个方法(除其他外)为一些局部变量实例化一些中等大小的数组(~50 个元素)。

经过一些分析后,我发现这种方法是性能瓶颈。并不是说该方法需要花费很长的时间来调用;而是该方法需要很长的时间来调用。相反,它被简单地快速调用很多次(一次会话中数十万到数百万次,这将是几个小时)。因此,即使对其性能进行相对较小的改进也是值得的。

我突然想到,也许我可以使用标记为 [ThreadStatic] 的字段,而不是在每次调用时分配一个新数组;每当调用该方法时,它都会检查该字段是否在当前线程上初始化,如果没有,则对其进行初始化。从那时起,同一线程上的所有调用都将有一个数组准备就绪。

(该方法初始化数组本身中的每个元素,因此数组中存在“陈旧”元素不应该成为问题。)

我的问题很简单:这看起来是个好主意吗?以这种方式使用 ThreadStatic 属性(即,作为一种性能优化,以减轻为局部变量实例化新对象的成本)是否存在我应该了解的陷阱? ThreadStatic 字段本身的性能可能不是很好;例如,是否有很多额外的“东西”在后台发生,有其自己的一套成本,以使此功能成为可能?

对我来说,尝试优化像 50 个元素数组这样便宜(?)的东西也是错误的,如果是这样,一定要让我知道,但是一般问题仍然成立。

Update: as I should have expected, the community's sound advice in response to this question was to "measure it and see." chibacity posted an answer with some really nice tests that did this for me; meanwhile, I wrote a test of my own; and the performance difference I saw was actually so huge that I felt compelled to write a blog post about it.

However, I should also acknowledge Hans's explanation that the ThreadStatic attribute is indeed not free and in fact relies on a CLR helper method to work its magic. This makes it far from obvious whether it would be an appropriate optimization to apply in any arbitrary case.

The good news for me is that, in my case, it seems to have made a big improvement.


I have a method which (among many other things) instantiates some medium-size arrays (~50 elements) for a few local variables.

After some profiling I've identified this method as something of a performance bottleneck. It isn't that the method takes an extremely long time to call; rather, it is simply called many times, very quickly (hundreds of thousands to millions of times in a session, which will be several hours). So even relatively small improvements to its performance should be worthwhile.

It occurred to me that maybe instead of allocating a new array on each call, I could use fields marked [ThreadStatic]; whenever the method is called, it will check if the field is initialized on the current thread, and if not, initialize it. From that point on all calls on the same thread will have an array all ready to go at that point.

(The method initializes every element in the array itself, so having "stale" elements in the array should not be an issue.)

My question is simply this: does this seem like a good idea? Are there pitfalls to using the ThreadStatic attribute in this way (i.e., as a performance optimization to mitigate the cost of instantiating new objects for local variables) that I should know about? Is the performance of a ThreadStatic field itself perhaps not great; e.g., is there a lot of extra "stuff" happening in the background, with its own set of costs, to make this feature possible?

It's also quite plausible to me that I'm wrong to even try to optimize something as cheap (?) as a 50-element array—and if that's so, definitely let me know—but the general question still holds.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

桃扇骨 2024-10-22 02:07:50

[ThreadStatic] 没有免费的午餐。对变量的每次访问都需要通过 CLR 中的辅助函数 (JIT_GetThreadFieldAddr_Primitive/Objref),而不是通过抖动进行内联编译。它也不是局部变量的真正替代品,递归将转向字节。您确实必须自己对此进行分析,用循环中那么多 CLR 代码来猜测性能是不可行的。

[ThreadStatic] is no free lunch. Every access to the variable needs to go through a helper function in the CLR (JIT_GetThreadFieldAddr_Primitive/Objref) instead of being compiled inline by the jitter. It also isn't a true substitute for a local variable, recursion is going to byte. You really have to profile this yourself, guesstimating perf with that much CLR code in the loop isn't feasible.

浅浅淡淡 2024-10-22 02:07:50

我进行了一个简单的基准测试,对于问题中描述的简单参数,ThreadStatic 表现更好。

与许多具有大量迭代的算法一样,我怀疑这是分配新数组的版本中 GC 开销杀死它的一个简单情况:

更新

通过包含数组的附加迭代的测试除了先前在本地复制引用的测试之外,还要对最小数组引用使用进行建模,再加上 ThreadStatic 数组引用使用:

Iterations : 10,000,000

Local ArrayRef          (- array iteration) : 330.17ms
Local ArrayRef          (- array iteration) : 327.03ms
Local ArrayRef          (- array iteration) : 1382.86ms
Local ArrayRef          (- array iteration) : 1425.45ms
Local ArrayRef          (- array iteration) : 1434.22ms
TS    CopyArrayRefLocal (- array iteration) : 107.64ms
TS    CopyArrayRefLocal (- array iteration) : 92.17ms
TS    CopyArrayRefLocal (- array iteration) : 92.42ms
TS    CopyArrayRefLocal (- array iteration) : 92.07ms
TS    CopyArrayRefLocal (- array iteration) : 92.10ms
Local ArrayRef          (+ array iteration) : 1740.51ms
Local ArrayRef          (+ array iteration) : 1647.26ms
Local ArrayRef          (+ array iteration) : 1639.80ms
Local ArrayRef          (+ array iteration) : 1639.10ms
Local ArrayRef          (+ array iteration) : 1646.56ms
TS    CopyArrayRefLocal (+ array iteration) : 368.03ms
TS    CopyArrayRefLocal (+ array iteration) : 367.19ms
TS    CopyArrayRefLocal (+ array iteration) : 367.22ms
TS    CopyArrayRefLocal (+ array iteration) : 368.20ms
TS    CopyArrayRefLocal (+ array iteration) : 367.37ms
TS    TSArrayRef        (+ array iteration) : 360.45ms
TS    TSArrayRef        (+ array iteration) : 359.97ms
TS    TSArrayRef        (+ array iteration) : 360.48ms
TS    TSArrayRef        (+ array iteration) : 360.03ms
TS    TSArrayRef        (+ array iteration) : 359.99ms

代码:

[ThreadStatic]
private static int[] _array;

[Test]
public object measure_thread_static_performance()
{
    const int TestIterations = 5;
    const int Iterations = (10 * 1000 * 1000);
    const int ArraySize = 50;

    Action<string, Action> time = (name, test) =>
    {
        for (int i = 0; i < TestIterations; i++)
        {
            TimeSpan elapsed = TimeTest(test, Iterations);
            Console.WriteLine("{0} : {1:F2}ms", name, elapsed.TotalMilliseconds);
        }
    };

    int[] array = null;
    int j = 0;

    Action test1 = () =>
    {
        array = new int[ArraySize];
    };

    Action test2 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);
    };

    Action test3 = () =>
    {
        array = new int[ArraySize];

        for (int i = 0; i < ArraySize; i++)
        {
            j = array[i];
        }
    };

    Action test4 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);

        for (int i = 0; i < ArraySize; i++)
        {
            j = array[i];
        }
    };

    Action test5 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);

        for (int i = 0; i < ArraySize; i++)
        {
            j = _array[i];
        }
    };

    Console.WriteLine("Iterations : {0:0,0}\r\n", Iterations);
    time("Local ArrayRef          (- array iteration)", test1);
    time("TS    CopyArrayRefLocal (- array iteration)", test2);
    time("Local ArrayRef          (+ array iteration)", test3);
    time("TS    CopyArrayRefLocal (+ array iteration)", test4);
    time("TS    TSArrayRef        (+ array iteration)", test5);

    Console.WriteLine(j);

    return array;
}

[SuppressMessage("Microsoft.Reliability", "CA2001:AvoidCallingProblematicMethods", MessageId = "System.GC.Collect")]
private static TimeSpan TimeTest(Action action, int iterations)
{
    Action gc = () =>
    {
        GC.Collect();
        GC.WaitForFullGCComplete();
    };

    Action empty = () => { };

    Stopwatch stopwatch1 = Stopwatch.StartNew();

    for (int j = 0; j < iterations; j++)
    {
        empty();
    }

    TimeSpan loopElapsed = stopwatch1.Elapsed;

    gc();
    action(); //JIT
    action(); //Optimize

    Stopwatch stopwatch2 = Stopwatch.StartNew();

    for (int j = 0; j < iterations; j++) action();

    gc();

    TimeSpan testElapsed = stopwatch2.Elapsed;

    return (testElapsed - loopElapsed);
}

I have carried out a simple benchmark and ThreadStatic performs better for the simple parameters described in the question.

As with many algorithms which have a high number of iterations, I suspect it is a straightforward case of GC overhead killing it for the version which allocates new arrays:

Update

With tests that include an added iteration of the array to model minimal array reference use, plus ThreadStatic array reference usage in addition to previous test where reference was copied local:

Iterations : 10,000,000

Local ArrayRef          (- array iteration) : 330.17ms
Local ArrayRef          (- array iteration) : 327.03ms
Local ArrayRef          (- array iteration) : 1382.86ms
Local ArrayRef          (- array iteration) : 1425.45ms
Local ArrayRef          (- array iteration) : 1434.22ms
TS    CopyArrayRefLocal (- array iteration) : 107.64ms
TS    CopyArrayRefLocal (- array iteration) : 92.17ms
TS    CopyArrayRefLocal (- array iteration) : 92.42ms
TS    CopyArrayRefLocal (- array iteration) : 92.07ms
TS    CopyArrayRefLocal (- array iteration) : 92.10ms
Local ArrayRef          (+ array iteration) : 1740.51ms
Local ArrayRef          (+ array iteration) : 1647.26ms
Local ArrayRef          (+ array iteration) : 1639.80ms
Local ArrayRef          (+ array iteration) : 1639.10ms
Local ArrayRef          (+ array iteration) : 1646.56ms
TS    CopyArrayRefLocal (+ array iteration) : 368.03ms
TS    CopyArrayRefLocal (+ array iteration) : 367.19ms
TS    CopyArrayRefLocal (+ array iteration) : 367.22ms
TS    CopyArrayRefLocal (+ array iteration) : 368.20ms
TS    CopyArrayRefLocal (+ array iteration) : 367.37ms
TS    TSArrayRef        (+ array iteration) : 360.45ms
TS    TSArrayRef        (+ array iteration) : 359.97ms
TS    TSArrayRef        (+ array iteration) : 360.48ms
TS    TSArrayRef        (+ array iteration) : 360.03ms
TS    TSArrayRef        (+ array iteration) : 359.99ms

Code:

[ThreadStatic]
private static int[] _array;

[Test]
public object measure_thread_static_performance()
{
    const int TestIterations = 5;
    const int Iterations = (10 * 1000 * 1000);
    const int ArraySize = 50;

    Action<string, Action> time = (name, test) =>
    {
        for (int i = 0; i < TestIterations; i++)
        {
            TimeSpan elapsed = TimeTest(test, Iterations);
            Console.WriteLine("{0} : {1:F2}ms", name, elapsed.TotalMilliseconds);
        }
    };

    int[] array = null;
    int j = 0;

    Action test1 = () =>
    {
        array = new int[ArraySize];
    };

    Action test2 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);
    };

    Action test3 = () =>
    {
        array = new int[ArraySize];

        for (int i = 0; i < ArraySize; i++)
        {
            j = array[i];
        }
    };

    Action test4 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);

        for (int i = 0; i < ArraySize; i++)
        {
            j = array[i];
        }
    };

    Action test5 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);

        for (int i = 0; i < ArraySize; i++)
        {
            j = _array[i];
        }
    };

    Console.WriteLine("Iterations : {0:0,0}\r\n", Iterations);
    time("Local ArrayRef          (- array iteration)", test1);
    time("TS    CopyArrayRefLocal (- array iteration)", test2);
    time("Local ArrayRef          (+ array iteration)", test3);
    time("TS    CopyArrayRefLocal (+ array iteration)", test4);
    time("TS    TSArrayRef        (+ array iteration)", test5);

    Console.WriteLine(j);

    return array;
}

[SuppressMessage("Microsoft.Reliability", "CA2001:AvoidCallingProblematicMethods", MessageId = "System.GC.Collect")]
private static TimeSpan TimeTest(Action action, int iterations)
{
    Action gc = () =>
    {
        GC.Collect();
        GC.WaitForFullGCComplete();
    };

    Action empty = () => { };

    Stopwatch stopwatch1 = Stopwatch.StartNew();

    for (int j = 0; j < iterations; j++)
    {
        empty();
    }

    TimeSpan loopElapsed = stopwatch1.Elapsed;

    gc();
    action(); //JIT
    action(); //Optimize

    Stopwatch stopwatch2 = Stopwatch.StartNew();

    for (int j = 0; j < iterations; j++) action();

    gc();

    TimeSpan testElapsed = stopwatch2.Elapsed;

    return (testElapsed - loopElapsed);
}
佼人 2024-10-22 02:07:50

从类似的结果 这个,ThreadStatic 看起来相当快。我不确定是否有人对它是否比重新分配 50 个元素的数组更快有具体的答案。这就是你必须对自己进行基准测试的事情。 :)

我有点纠结这是否是一个“好主意”。只要所有的实现细节都保留在类中,这不一定是一个坏主意(您真的不希望调用者必须担心它),但除非基准测试显示此方法可以提高性能,否则我会坚持简单地使用每次都分配数组,因为它使代码更简单且更易于阅读。由于这两种解决方案中较为复杂的一种,因此在选择这个解决方案之前,我需要先了解其复杂性带来的一些好处。

From results like this, ThreadStatic looks pretty fast. I'm not sure that anybody has a specific answer to if it's faster then reallocating a 50 element array though. That's the kind of thing you'll have to benchmark yourself. :)

I'm somewhat torn on if it's a "good idea" or not. So long as all the implementation details are kept inside the class it's not necessarily a bad idea (you really don't want the caller to have to worry about it), but unless benchmarks showed a performance gain from this method I would stick to simply allocating the array each time because it makes the code simpler and easier to read. As the more complicated of the two solutions, I'd need to see some benefit from the complexity before choosing this one.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文