为什么线程本地存储这么慢?

发布于 2024-07-13 00:57:54 字数 343 浏览 15 评论 0原文

我正在为 D 编程语言开发一个自定义标记释放样式的内存分配器,它通过从线程局部区域进行分配来工作。 与代码的其他相同的单线程版本相比,线程本地存储瓶颈似乎导致从这些区域分配内存的速度大幅减慢(约 50%),即使在将我的代码设计为每次分配仅进行一次 TLS 查找之后/解除分配。 这是基于在循环中多次分配/释放内存,我试图弄清楚这是否是我的基准测试方法的产物。 我的理解是,线程本地存储基本上应该只涉及通过额外的间接层访问某些内容,类似于通过指针访问变量。 这是不正确的吗? 线程本地存储通常有多少开销?

注意:虽然我提到了 D,但我也对不特定于 D 的一般答案感兴趣,因为如果 D 的线程本地存储实现比最佳实现慢,它可能会得到改进。

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?

Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

蹲在坟头点根烟 2024-07-20 00:57:54

速度取决于 TLS 实施。

是的,您是对的,TLS 可以与指针查找一样快。 在具有内存管理单元的系统上甚至可以更快。

对于指针查找,您需要调度程序的帮助。 调度程序必须在任务切换时更新指向 TLS 数据的指针。

实现 TLS 的另一种快速方法是通过内存管理单元。 此处,TLS 的处理方式与任何其他数据相同,但 TLS 变量分配在特殊段中。 调度程序将在任务切换时将正确的内存块映射到任务的地址空间中。

如果调度程序不支持任何这些方法,则编译器/库必须执行以下操作:

  • 获取当前 ThreadId
  • 获取信号量
  • 通过 ThreadId 查找指向 TLS 块的指针(可以使用映射等)
  • 释放信号量
  • 返回 该信号量指针。

显然,为每个 TLS 数据访问执行所有这些操作需要一段时间,并且可能需要最多三个操作系统调用:获取 ThreadId、获取和释放信号量。

顺便说一句,需要信号量来确保当另一个线程正在生成新线程时没有线程从 TLS 指针列表中读取。 (并因此分配一个新的 TLS 块并修改数据结构)。

不幸的是,在实践中 TLS 实现速度缓慢的情况并不少见。

The speed depends on the TLS implementation.

Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.

For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.

Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.

If the scheduler does not support any of these methods, the compiler/library has to do the following:

  • get current ThreadId
  • Take a semaphore
  • Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
  • Release the semaphore
  • Return that pointer.

Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.

The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).

Unfortunately it's not uncommon to see the slow TLS implementation in practice.

○愚か者の日 2024-07-20 00:57:54

D 中的线程局部变量非常快。 这是我的测试。

64 位 Ubuntu,核心 i5,dmd v2.052
编译器选项: dmd -O -release -inline -m64

// this loop takes 0m0.630s
void main(){
    int a; // register allocated
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

因此,每 1000*1000*1000 线程本地访问,我们仅损失一个 CPU 核心 1.2 秒。
使用 %fs 寄存器访问线程局部变量 - 因此只涉及几个处理器命令:

使用 objdump -d 进行反汇编:

- this is local variable in %ecx register (loop counter in %eax):
   8:   31 c9                   xor    %ecx,%ecx
   a:   b8 00 ca 9a 3b          mov    $0x3b9aca00,%eax
   f:   83 c1 09                add    $0x9,%ecx
  12:   ff c8                   dec    %eax
  14:   85 c0                   test   %eax,%eax
  16:   75 f7                   jne    f <_Dmain+0xf>

- this is thread local, %fs register is used for indirection, %edx is loop counter:
   6:   ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx
   b:   64 48 8b 04 25 00 00    mov    %fs:0x0,%rax
  12:   00 00 
  14:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # 1b <_Dmain+0x1b>
  1b:   83 04 08 09             addl   $0x9,(%rax,%rcx,1)
  1f:   ff ca                   dec    %edx
  21:   85 d2                   test   %edx,%edx
  23:   75 e6                   jne    b <_Dmain+0xb>

也许编译器可以更聪明,在循环到寄存器之前缓存线程局部变量
并在最后将其返回到本地线程(与 gdc 编译器比较很有趣),
但即使是现在,恕我直言,情况也很好。

Thread locals in D are really fast. Here are my tests.

64 bit Ubuntu, core i5, dmd v2.052
Compiler options: dmd -O -release -inline -m64

// this loop takes 0m0.630s
void main(){
    int a; // register allocated
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses.
Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:

Disassembling with objdump -d:

- this is local variable in %ecx register (loop counter in %eax):
   8:   31 c9                   xor    %ecx,%ecx
   a:   b8 00 ca 9a 3b          mov    $0x3b9aca00,%eax
   f:   83 c1 09                add    $0x9,%ecx
  12:   ff c8                   dec    %eax
  14:   85 c0                   test   %eax,%eax
  16:   75 f7                   jne    f <_Dmain+0xf>

- this is thread local, %fs register is used for indirection, %edx is loop counter:
   6:   ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx
   b:   64 48 8b 04 25 00 00    mov    %fs:0x0,%rax
  12:   00 00 
  14:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # 1b <_Dmain+0x1b>
  1b:   83 04 08 09             addl   $0x9,(%rax,%rcx,1)
  1f:   ff ca                   dec    %edx
  21:   85 d2                   test   %edx,%edx
  23:   75 e6                   jne    b <_Dmain+0xb>

Maybe compiler could be even more clever and cache thread local before loop to a register
and return it to thread local at the end (it's interesting to compare with gdc compiler),
but even now matters are very good IMHO.

陌生 2024-07-20 00:57:54

在解释基准测试结果时需要非常小心。 例如,D 新闻组中最近的一个线程从基准测试中得出结论,dmd 的代码生成导致执行算术的循环显着减慢,但实际上,所花费的时间主要由执行长除法的运行时辅助函数决定。 编译器的代码生成与速度减慢无关。

要查看为 tls 生成什么样的代码,请编译并 obj2asm 这段代码:

__thread int x;
int foo() { return x; }

TLS 在 Windows 上的实现与在 Linux 上的实现非常不同,并且在 OSX 上也将非常不同。 但是,在所有情况下,它都会比静态内存位置的简单加载更多的指令。 相对于简单访问,TLS 总是会很慢。 在紧密循环中访问 TLS 全局变量也会很慢。 尝试将 TLS 值缓存在临时文件中。

我几年前编写了一些线程池分配代码,并将 TLS 句柄缓存到池中,效果很好。

One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.

To see what kind of code is generated for tls, compile and obj2asm this code:

__thread int x;
int foo() { return x; }

TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.

I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.

人│生佛魔见 2024-07-20 00:57:54

我为嵌入式系统设计了多任务程序,从概念上讲,线程本地存储的关键要求是让上下文切换方法保存/恢复指向线程本地存储的指针以及 CPU 寄存器以及它保存/恢复的任何其他内容。 对于一旦启动就始终运行同一组代码的嵌入式系统,最简单的方法是简单地保存/恢复一个指针,该指针指向每个线程的固定格式块。 漂亮、干净、简单、高效。

如果人们不介意为每个线程中分配的每个线程局部变量(即使是那些从未实际使用过它的线程)分配空间,并且如果线程局部存储块中的所有内容都可以被分配,那么这种方法就很有效。定义为单个结构体。 在这种情况下,访问线程局部变量几乎与访问其他变量一样快,唯一的区别是额外的指针取消引用。 不幸的是,许多 PC 应用程序需要更复杂的东西。

在 PC 的某些框架上,如果使用线程静态变量的模块已在该线程上运行,则该线程将仅分配给该线程静态变量的空间。 虽然这有时是有利的,但这意味着不同的线程通常会以不同的方式布局其本地存储。 因此,线程可能需要有某种可搜索的索引来表示其变量所在的位置,并通过该索引引导对这些变量的所有访问。

我预计,如果框架分配少量固定格式存储,则保留最后访问的 1-3 个线程局部变量的缓存可能会有所帮助,因为在许多情况下,即使是单项缓存也可以提供命中率相当高。

I've designed multi-taskers for embedded systems, and conceptually the key requirement for thread-local storage is having the context switch method save/restore a pointer to thread-local storage along with the CPU registers and whatever else it's saving/restoring. For embedded systems which will always be running the same set of code once they've started up, it's easiest to simply save/restore one pointer, which points to a fixed-format block for each thread. Nice, clean, easy, and efficient.

Such an approach works well if one doesn't mind having space for every thread-local variable allocated within every thread--even those that never actually use it--and if everything that's going to be within the thread-local storage block can be defined as a single struct. In that scenario, accesses to thread-local variables can be almost as fast as access to other variables, the only difference being an extra pointer dereference. Unfortunately, many PC applications require something more complicated.

On some frameworks for the PC, a thread will only have space allocated for thread-static variables if a module that uses those variables has been run on that thread. While this can sometimes be advantageous, it means that different threads will often have their local storage laid out differently. Consequently, it may be necessary for the threads to have some sort of searchable index of where their variables are located, and to direct all accesses to those variables through that index.

I would expect that if the framework allocates a small amount of fixed-format storage, it may be helpful to keep a cache of the last 1-3 thread-local variables accessed, since in many scenarios even a single-item cache could offer a pretty high hit rate.

橘亓 2024-07-20 00:57:54

如果您无法使用编译器 TLS 支持,您可以自行管理 TLS。
我为 C++ 构建了一个包装模板,因此很容易替换底层实现。
在此示例中,我为 Win32 实现了它。
注意:由于您无法为每个进程获取无限数量的 TLS 索引(至少在 Win32 下),
您应该指向足够大的堆块来容纳所有线程特定的数据。
这样您就可以拥有最少数量的 TLS 索引和相关查询。
在“最佳情况”下,每个线程只有 1 个 TLS 指针指向一个私有堆块。

简而言之:不要指向单个对象,而是指向特定于线程的、保存对象指针的堆内存/容器,以实现更好的性能。

如果不再使用内存,请不要忘记释放内存。
我通过将线程包装到类中(就像 Java 那样)并通过构造函数和析构函数处理 TLS 来实现此目的。
此外,我将常用的数据(例如线程句柄和 ID)存储为类成员。

用法:

对于类型*:
tl_ptr<类型>

对于常量类型*:
tl_ptr<常量类型>

对于类型* const:
const tl_ptr<类型>

常量类型*常量:
const tl_ptr<常量类型>

template<typename T>
class tl_ptr {
protected:
    DWORD index;
public:
    tl_ptr(void) : index(TlsAlloc()){
        assert(index != TLS_OUT_OF_INDEXES);
        set(NULL);
    }
    void set(T* ptr){
        TlsSetValue(index,(LPVOID) ptr);
    }
    T* get(void)const {
        return (T*) TlsGetValue(index);
    }
    tl_ptr& operator=(T* ptr){
        set(ptr);
        return *this;
    }
    tl_ptr& operator=(const tl_ptr& other){
        set(other.get());
        return *this;
    }
    T& operator*(void)const{
        return *get();
    }
    T* operator->(void)const{
        return get();
    }
    ~tl_ptr(){
        TlsFree(index);
    }
};

If you can't use compiler TLS support, you can manage TLS yourself.
I built a wrapper template for C++, so it is easy to replace an underlying implementation.
In this example, i've implemented it for Win32.
Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32),
you should point to heap blocks large enough to hold all thread specific data.
This way you have a minimum number of TLS indices and related queries.
In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.

In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.

Don't forget to free memory if it isn't used again.
I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor.
Furthermore, i store frequently used data like thread handles and ID's as class members.

usage:

for type*:
tl_ptr<type>

for const type*:
tl_ptr<const type>

for type* const:
const tl_ptr<type>

const type* const:
const tl_ptr<const type>

template<typename T>
class tl_ptr {
protected:
    DWORD index;
public:
    tl_ptr(void) : index(TlsAlloc()){
        assert(index != TLS_OUT_OF_INDEXES);
        set(NULL);
    }
    void set(T* ptr){
        TlsSetValue(index,(LPVOID) ptr);
    }
    T* get(void)const {
        return (T*) TlsGetValue(index);
    }
    tl_ptr& operator=(T* ptr){
        set(ptr);
        return *this;
    }
    tl_ptr& operator=(const tl_ptr& other){
        set(other.get());
        return *this;
    }
    T& operator*(void)const{
        return *get();
    }
    T* operator->(void)const{
        return get();
    }
    ~tl_ptr(){
        TlsFree(index);
    }
};
情何以堪。 2024-07-20 00:57:54

我们在 TLS(在 Windows 上)中看到了类似的性能问题。 我们依靠它来执行产品“内核”内的某些关键操作。经过一番努力,我决定尝试对此进行改进。

我很高兴地说,我们现在有一个小型 API,可以将 CPU 时间减少 50% 以上当调用线程不“知道”其线程 ID 时,如果调用线程已经获得其线程 ID(可能对于其他一些早期处理步骤),则减少 65% 以上的等效操作

(新函数 get_thread_private_ptr( ) ) 总是返回一个指向我们内部使用的用于保存所有类型的结构的指针,因此每个线程只需要一个指针。

总而言之,我认为 Win32 TLS 支持确实做得很差。

We have seen similar performance issues from TLS (on Windows). We rely on it for certain critical operations inside our product's "kernel'. After some effort I decided to try and improve on this.

I'm pleased to say that we now have a small API that offers > 50% reduction in CPU time for an equivalent operation when the callin thread doesn't "know" its thread-id and > 65% reduction if calling thread has already obtained its thread-id (perhaps for some other earlier processing step).

The new function ( get_thread_private_ptr() ) always returns a pointer to a struct we use internally to hold all sorts, so we only need one per thread.

All in all I think the Win32 TLS support is poorly crafted really.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文