thread_local 的成本
既然 C++ 正在添加 thread_local 存储作为一种语言功能,我想知道一些事情:
- thead_local 的成本可能是多少?
- 在记忆中?
- 用于读取和写入操作?
- 与之相关的是:操作系统通常如何实现这一点?似乎任何声明的
thread_local
都必须为创建的每个线程提供特定于线程的存储空间。
Now that C++ is adding thread_local
storage as a language feature, I'm wondering a few things:
- What is the cost of
thead_local
likely to be?- In memory?
- For read and write operations?
- Associated with that: how do Operating Systems usually implement this? It would seem like anything declared
thread_local
would have to be given thread-specific storage space for each thread created.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
存储空间:变量的大小 * 线程数,或者可能是 (sizeof(var) + sizeof(var*)) * 线程数。
实现线程本地存储有两种基本方法:
使用某种系统调用来获取有关当前内核线程的信息。 Sloooow。
使用一些指针,可能在处理器寄存器中,该指针在内核每次线程上下文切换时正确设置 - 同时与所有其他寄存器一样。便宜。
在英特尔平台上,变体 2 通常通过某些段寄存器(FS 或 GS,我不记得了)来实现。 GCC 和 MSVC 都支持这一点。因此,访问时间与全局变量的访问时间一样快。
这也是可能的,但我还没有在实践中看到它,因为这是通过现有的库函数(如 pthread_getspecic)来实现的。性能将类似于 1. 或 2.,加上库调用开销。请记住,变体 2.+ 库调用开销仍然比内核调用快得多。
Storage space: size of the variable * number of threads, or possibly (sizeof(var) + sizeof(var*)) * number of threads.
There are two basic ways of implementing thread-local storage:
Using some sort of system call that gets information about the current kernel thread. Sloooow.
Using some pointer, probably in a processor register, that is set properly at every thread context switch by the kernel - at the same time as all the other registers. Cheap.
On intel platforms, variant 2 is usually implemented via some segment register (FS or GS, I don't remember). Both GCC and MSVC support this. Access times are therefore about as fast as for global variables.
It is also possible, but I haven't seen it yet in practice, for this to be implemented via existing library functions like
pthread_getspecific
. Performance would then be like 1. or 2., plus library call overhead. Keep in mind that variant 2. + library call overhead is still a lot faster than a kernel call.Uli Drepper(glibc 的维护者)对其在 Linux 上的工作原理的描述可以在这里找到:www.akkadia.org/drepper /tls.pdf
处理动态加载模块等的要求使整个机制有点复杂,这也许部分解释了为什么该文档的重量为 79 页(!)。
在内存使用方面,每个线程变量显然需要它自己的线程内存(尽管在某些情况下,这可以延迟完成,以便仅在首次访问变量时才分配空间),然后还有一些额外的数据结构偏移表等所需。
从性能角度来看,访问 TLS 变量的额外成本主要与检索变量的地址有关。在 x86 Linux 上,GS 寄存器用作获取线程 id 的起点,在 x86-64 FS 上。通常会有一些指针取消引用,以及动态加载代码的函数调用(__tls_get_addr)。创建新线程的速度也会变慢,因为实现需要分配空间并可能初始化所有 TLS 变量(如果不延迟完成)。
TLS 非常适合轻松地将一些旧的线程不安全代码模式变为线程安全(例如 errno),但对于从一开始就为多线程世界设计的新代码来说,很少需要它。
A description for how it works on Linux by Uli Drepper (maintainer of glibc) can be found here: www.akkadia.org/drepper/tls.pdf
The requirement to handle dynamically loaded modules etc. make the entire mechanism a bit convoluted, which perhaps partly explains why the document weights in at 79 pages (!).
Memory-usage-wise, each per-thread variable obviously needs it's own per-thread memory (although in some cases this can be done lazily such that the space is allocated only once the variable is first accessed), and then there's some extra datastructures that are needed for offset tables etc.
Performance-wise, the extra cost to access a TLS variable mostly revolves around retrieving the address of the variable. On x86 Linux the GS register is used as a start to get a thread id, on x86-64 FS. Usually there is a few pointer dereferences, and a function call (__tls_get_addr) for dynamically loaded code. There's also the cost that creating a new thread is slower because the implementation needs to allocate space and possibly initialize all the TLS vars (if not done lazily).
TLS is nice for easily making some old thread-unsafe code patterns thread-safe (think errno), but for new code designed from the start for a multi-threaded world it's very seldom needed.