multithreading performance d thread-local-storage

为什么线程本地存储这么慢？

发布于 2024-07-13 00:57:54 字数 343 浏览 18 评论 0原文

我正在为 D 编程语言开发一个自定义标记释放样式的内存分配器，它通过从线程局部区域进行分配来工作。与代码的其他相同的单线程版本相比，线程本地存储瓶颈似乎导致从这些区域分配内存的速度大幅减慢（约 50%），即使在将我的代码设计为每次分配仅进行一次 TLS 查找之后/解除分配。这是基于在循环中多次分配/释放内存，我试图弄清楚这是否是我的基准测试方法的产物。我的理解是，线程本地存储基本上应该只涉及通过额外的间接层访问某些内容，类似于通过指针访问变量。这是不正确的吗？线程本地存储通常有多少开销？

注意：虽然我提到了 D，但我也对不特定于 D 的一般答案感兴趣，因为如果 D 的线程本地存储实现比最佳实现慢，它可能会得到改进。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蹲在坟头点根烟 2024-07-20 00:57:54

速度取决于 TLS 实施。

是的，您是对的，TLS 可以与指针查找一样快。在具有内存管理单元的系统上甚至可以更快。

对于指针查找，您需要调度程序的帮助。调度程序必须在任务切换时更新指向 TLS 数据的指针。

实现 TLS 的另一种快速方法是通过内存管理单元。此处，TLS 的处理方式与任何其他数据相同，但 TLS 变量分配在特殊段中。调度程序将在任务切换时将正确的内存块映射到任务的地址空间中。

如果调度程序不支持任何这些方法，则编译器/库必须执行以下操作：

获取当前 ThreadId
获取信号量
通过 ThreadId 查找指向 TLS 块的指针（可以使用映射等）
释放信号量
返回该信号量指针。

显然，为每个 TLS 数据访问执行所有这些操作需要一段时间，并且可能需要最多三个操作系统调用：获取 ThreadId、获取和释放信号量。

顺便说一句，需要信号量来确保当另一个线程正在生成新线程时没有线程从 TLS 指针列表中读取。（并因此分配一个新的 TLS 块并修改数据结构）。

不幸的是，在实践中 TLS 实现速度缓慢的情况并不少见。

回复收藏 0 原文

○愚か者の日 2024-07-20 00:57:54

D 中的线程局部变量非常快。这是我的测试。

64 位 Ubuntu，核心 i5，dmd v2.052
编译器选项： dmd -O -release -inline -m64

// this loop takes 0m0.630s
void main(){
    int a; // register allocated
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

因此，每 1000*1000*1000 线程本地访问，我们仅损失一个 CPU 核心 1.2 秒。
使用 %fs 寄存器访问线程局部变量 - 因此只涉及几个处理器命令：

使用 objdump -d 进行反汇编：

- this is local variable in %ecx register (loop counter in %eax):
   8:   31 c9                   xor    %ecx,%ecx
   a:   b8 00 ca 9a 3b          mov    $0x3b9aca00,%eax
   f:   83 c1 09                add    $0x9,%ecx
  12:   ff c8                   dec    %eax
  14:   85 c0                   test   %eax,%eax
  16:   75 f7                   jne    f <_Dmain+0xf>

- this is thread local, %fs register is used for indirection, %edx is loop counter:
   6:   ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx
   b:   64 48 8b 04 25 00 00    mov    %fs:0x0,%rax
  12:   00 00 
  14:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # 1b <_Dmain+0x1b>
  1b:   83 04 08 09             addl   $0x9,(%rax,%rcx,1)
  1f:   ff ca                   dec    %edx
  21:   85 d2                   test   %edx,%edx
  23:   75 e6                   jne    b <_Dmain+0xb>

也许编译器可以更聪明，在循环到寄存器之前缓存线程局部变量
并在最后将其返回到本地线程（与 gdc 编译器比较很有趣），
但即使是现在，恕我直言，情况也很好。

Thread locals in D are really fast. Here are my tests.

64 bit Ubuntu, core i5, dmd v2.052
Compiler options: dmd -O -release -inline -m64

// this loop takes 0m0.630s
void main(){
    int a; // register allocated
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
    for( int i=1000*1000*1000; i>0; i-- ){
        a+=9;
    }
}

So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses.
Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:

Disassembling with objdump -d:

- this is local variable in %ecx register (loop counter in %eax):
   8:   31 c9                   xor    %ecx,%ecx
   a:   b8 00 ca 9a 3b          mov    $0x3b9aca00,%eax
   f:   83 c1 09                add    $0x9,%ecx
  12:   ff c8                   dec    %eax
  14:   85 c0                   test   %eax,%eax
  16:   75 f7                   jne    f <_Dmain+0xf>

- this is thread local, %fs register is used for indirection, %edx is loop counter:
   6:   ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx
   b:   64 48 8b 04 25 00 00    mov    %fs:0x0,%rax
  12:   00 00 
  14:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # 1b <_Dmain+0x1b>
  1b:   83 04 08 09             addl   $0x9,(%rax,%rcx,1)
  1f:   ff ca                   dec    %edx
  21:   85 d2                   test   %edx,%edx
  23:   75 e6                   jne    b <_Dmain+0xb>

Maybe compiler could be even more clever and cache thread local before loop to a register
and return it to thread local at the end (it's interesting to compare with gdc compiler),
but even now matters are very good IMHO.

回复收藏 0 原文

陌生 2024-07-20 00:57:54

在解释基准测试结果时需要非常小心。例如，D 新闻组中最近的一个线程从基准测试中得出结论，dmd 的代码生成导致执行算术的循环显着减慢，但实际上，所花费的时间主要由执行长除法的运行时辅助函数决定。编译器的代码生成与速度减慢无关。

要查看为 tls 生成什么样的代码，请编译并 obj2asm 这段代码：

__thread int x;
int foo() { return x; }

TLS 在 Windows 上的实现与在 Linux 上的实现非常不同，并且在 OSX 上也将非常不同。但是，在所有情况下，它都会比静态内存位置的简单加载更多的指令。相对于简单访问，TLS 总是会很慢。在紧密循环中访问 TLS 全局变量也会很慢。尝试将 TLS 值缓存在临时文件中。

我几年前编写了一些线程池分配代码，并将 TLS 句柄缓存到池中，效果很好。

One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.

To see what kind of code is generated for tls, compile and obj2asm this code:

__thread int x;
int foo() { return x; }

TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.

I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.

回复收藏 0 原文

人│生佛魔见 2024-07-20 00:57:54

我为嵌入式系统设计了多任务程序，从概念上讲，线程本地存储的关键要求是让上下文切换方法保存/恢复指向线程本地存储的指针以及 CPU 寄存器以及它保存/恢复的任何其他内容。对于一旦启动就始终运行同一组代码的嵌入式系统，最简单的方法是简单地保存/恢复一个指针，该指针指向每个线程的固定格式块。漂亮、干净、简单、高效。

如果人们不介意为每个线程中分配的每个线程局部变量（即使是那些从未实际使用过它的线程）分配空间，并且如果线程局部存储块中的所有内容都可以被分配，那么这种方法就很有效。定义为单个结构体。在这种情况下，访问线程局部变量几乎与访问其他变量一样快，唯一的区别是额外的指针取消引用。不幸的是，许多 PC 应用程序需要更复杂的东西。

在 PC 的某些框架上，如果使用线程静态变量的模块已在该线程上运行，则该线程将仅分配给该线程静态变量的空间。虽然这有时是有利的，但这意味着不同的线程通常会以不同的方式布局其本地存储。因此，线程可能需要有某种可搜索的索引来表示其变量所在的位置，并通过该索引引导对这些变量的所有访问。

我预计，如果框架分配少量固定格式存储，则保留最后访问的 1-3 个线程局部变量的缓存可能会有所帮助，因为在许多情况下，即使是单项缓存也可以提供命中率相当高。

回复收藏 0 原文

橘亓 2024-07-20 00:57:54

如果您无法使用编译器 TLS 支持，您可以自行管理 TLS。
我为 C++ 构建了一个包装模板，因此很容易替换底层实现。
在此示例中，我为 Win32 实现了它。
注意：由于您无法为每个进程获取无限数量的 TLS 索引（至少在 Win32 下），
您应该指向足够大的堆块来容纳所有线程特定的数据。
这样您就可以拥有最少数量的 TLS 索引和相关查询。
在“最佳情况”下，每个线程只有 1 个 TLS 指针指向一个私有堆块。

简而言之：不要指向单个对象，而是指向特定于线程的、保存对象指针的堆内存/容器，以实现更好的性能。

如果不再使用内存，请不要忘记释放内存。
我通过将线程包装到类中（就像 Java 那样）并通过构造函数和析构函数处理 TLS 来实现此目的。
此外，我将常用的数据（例如线程句柄和 ID）存储为类成员。

用法：

对于类型*：
tl_ptr<类型>
对于常量类型*：
tl_ptr<常量类型>
对于类型* const：
const tl_ptr<类型>
常量类型*常量：
const tl_ptr<常量类型>

template<typename T>
class tl_ptr {
protected:
    DWORD index;
public:
    tl_ptr(void) : index(TlsAlloc()){
        assert(index != TLS_OUT_OF_INDEXES);
        set(NULL);
    }
    void set(T* ptr){
        TlsSetValue(index,(LPVOID) ptr);
    }
    T* get(void)const {
        return (T*) TlsGetValue(index);
    }
    tl_ptr& operator=(T* ptr){
        set(ptr);
        return *this;
    }
    tl_ptr& operator=(const tl_ptr& other){
        set(other.get());
        return *this;
    }
    T& operator*(void)const{
        return *get();
    }
    T* operator->(void)const{
        return get();
    }
    ~tl_ptr(){
        TlsFree(index);
    }
};

If you can't use compiler TLS support, you can manage TLS yourself.
I built a wrapper template for C++, so it is easy to replace an underlying implementation.
In this example, i've implemented it for Win32.
Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32),
you should point to heap blocks large enough to hold all thread specific data.
This way you have a minimum number of TLS indices and related queries.
In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.

In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.

Don't forget to free memory if it isn't used again.
I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor.
Furthermore, i store frequently used data like thread handles and ID's as class members.

usage:

for type*:
tl_ptr<type>
for const type*:
tl_ptr<const type>
for type* const:
const tl_ptr<type>
const type* const:
const tl_ptr<const type>

template<typename T>
class tl_ptr {
protected:
    DWORD index;
public:
    tl_ptr(void) : index(TlsAlloc()){
        assert(index != TLS_OUT_OF_INDEXES);
        set(NULL);
    }
    void set(T* ptr){
        TlsSetValue(index,(LPVOID) ptr);
    }
    T* get(void)const {
        return (T*) TlsGetValue(index);
    }
    tl_ptr& operator=(T* ptr){
        set(ptr);
        return *this;
    }
    tl_ptr& operator=(const tl_ptr& other){
        set(other.get());
        return *this;
    }
    T& operator*(void)const{
        return *get();
    }
    T* operator->(void)const{
        return get();
    }
    ~tl_ptr(){
        TlsFree(index);
    }
};

回复收藏 0 原文