boost::thread 数据结构大小太荒谬了?
编译器:linux 上的 clang++ x86-64。
我已经有一段时间没有编写任何复杂的低级系统代码了,而且我通常针对系统原语(Windows 和 pthreads/posix)进行编程。所以,进出的事情已经从我的记忆中消失了。我目前正在使用 boost::asio
和 boost::thread
。
为了针对异步函数执行器(使用多个线程 io::service::run
的 boost::io_service
,其中请求是 io_serviced)模拟同步 RPC: :post
'ed),我正在使用 boost 同步原语。出于好奇,我决定对原语进行sizeof
。这就是我所看到的。
struct notification_object
{
bool ready;
boost::mutex m;
boost::condition_variable v;
};
...
std::cout << sizeof(bool) << std::endl;
std::cout << sizeof(boost::mutex) << std::endl;
std::cout << sizeof(boost::condition_variable) << std::endl;
std::cout << sizeof(notification_object) << std::endl;
...
输出:
1
40
88
136
互斥体 40 个字节?? ?? ?卧槽! 88 表示条件变量!请记住,我对这种臃肿的大小感到厌恶,因为我正在考虑一个可以创建数百个 notification_object
的应用程序。
这种可移植性的开销似乎很荒谬,有人可以证明这一点吗?据我所知,这些原语应该是 4 或 8 字节宽,具体取决于 CPU 的内存模型。
Compiler: clang++ x86-64 on linux.
It has been a while since I have written any intricate low level system code, and I ussualy program against the system primitives (windows and pthreads/posix). So, the in#s and out's have slipped from my memory. I am working with boost::asio
and boost::thread
at the moment.
In order to emulate synchronous RPC against an asynchronous function executor (boost::io_service
with multiple threads io::service::run
'ing where requests are io_serviced::post
'ed), I am using boost synchronization primitives. For curiosities sake I decided to sizeof
the primitives. This is what I get to see.
struct notification_object
{
bool ready;
boost::mutex m;
boost::condition_variable v;
};
...
std::cout << sizeof(bool) << std::endl;
std::cout << sizeof(boost::mutex) << std::endl;
std::cout << sizeof(boost::condition_variable) << std::endl;
std::cout << sizeof(notification_object) << std::endl;
...
Output:
1
40
88
136
Forty bytes for a mutex ?? ?? ? WTF ! 88 for a condition_variable !!! Please keep in mind that I'm repulsed by this bloated size because I am thinking of an application that could create hundreds of notification_object
's
This level of overhead for portability seems ridiculous, can someone justify this? As far as I can remember these primitives should be 4 or 8 bytes wide depending on the memory model of the CPU.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
当您查看任何类型的同步原语的“大小开销”时,请记住这些不能打包得太紧密。之所以如此,是因为例如共享缓存行的两个互斥体如果同时使用,最终会导致缓存垃圾(错误共享),即使获取这些锁的用户从不“冲突”。即想象两个线程运行两个循环:
与
运行一个循环的一个线程相比,在两个不同的线程上运行时,您将看到迭代次数的两倍当且仅这两个锁不在范围内相同的缓存行。如果
lockA
和lockB
位于同一缓存行中,则每个线程的迭代次数将减半 - 因为具有这两个锁的缓存行将永久在执行这两个线程的 cpu 核心之间反弹。因此,即使自旋锁或互斥体底层的原始数据类型的实际数据大小可能只是一个字节或32位字,但此类对象的有效数据大小是通常更大。
在断言“我的互斥体太大”之前请记住这一点。事实上,在 x86/x64 上,40 字节太小,无法防止错误共享,因为缓存行目前至少有 64 字节。
除此之外,如果您高度关注内存使用情况,请考虑通知对象不必是唯一的 - 条件变量可以用于触发不同的事件(通过
boost::condition_variable 的
知道)。因此,可以对整个状态机使用单个互斥锁/CV 对,而不是每个状态一对。同样的情况也适用于线程池同步——拥有比线程更多的锁并不一定是有益的。predicate
编辑:有关“错误共享”(以及在同一缓存行中托管多个原子更新变量所造成的负面性能影响)的更多参考,请参阅(除其他外)以下 SO 帖子:
如前所述,当使用多个“同步对象”时(无论是原子更新的变量、锁、信号量,...)在多核、每核缓存配置中,允许它们每个都有一个单独的缓存线空间。您在这里用内存使用来换取可扩展性,但实际上,如果您进入软件需要数百万个锁(需要 GB 内存)的区域,您要么有资金购买几百 GB 内存(以及一百个CPU核心),或者你在软件设计中做错了什么。
在大多数情况下(
class
/struct
的特定实例的锁/原子),只要包含的对象实例,您就可以免费获得“填充”原子变量足够大。When you look at "size overhead" for any type of synchronization primitive, keep in mind that these cannot be packed too closely. That is so because e.g. two mutexes sharing a cacheline would end up in cache trashing (false sharing) if they're in-use concurrently, even if the users acquiring these locks never "conflict". I.e. imagine two threads running two loops:
and
You will see twice the number of iterations when run on two different threads compared to one thread running one loop if and only if the two locks are not within the same cacheline. If
lockA
andlockB
are in the same cacheline, the number of iterations per thread will half - because the cacheline with those two locks in will permanently bounce between the cpu cores executing these two threads.Hence even though the actual data size of the primitive data type underlying a spinlock or mutex might only be a byte or a 32bit word, the effective data size of such an object is often larger.
Keep that in mind before asserting "my mutexes are too large". In fact, on x86/x64, 40 Bytes is too small to prevent false sharing, as cachelines there are currently at least 64 Bytes.
Beyond that, if you're highly concerned about memory usage, consider that notification objects need not be unique - condition variables can serve to trigger for different events (via the
predicate
thatboost::condition_variable
knows about). It'd therefore be possible to use a single mutex/CV pair for a whole state machine instead of one such pair per state. Same goes for e.g. thread pool synchronization - having more locks than threads is not necessarily beneficial.Edit: For a few more references on "false sharing" (and the negative performance impact caused by hosting multiple atomically-updated variables within the same cacheline), see (amongst others) the following SO postings:
As said, when using multiple "synchronization objects" (whether that'd be atomically-updated variables, locks, semaphores, ...) in a multi-core, cache-per-core config, allow each of them a separate cacheline of space. You're trading memory usage for scalability here, but really, if you get into the region where your software needs several millions of locks (making that GBs of mem), you either have the funding for a few hundred GB of memory (and a hundred CPU cores), or you're doing something wrong in your software design.
In most cases (a lock / an atomic for a specific instance of a
class
/struct
), you get the "padding" for free as long as the object instance that contains the atomic variable is large enough.在我的 64 位 Ubuntu 机器上,显示以下内容:
打印
这表明您声称
这根本不是真的。
如果您想知道
boost::condition_variable
占用的额外 40 个字节从何而来,Boost 类使用内部互斥体。简而言之,在此平台上,与
pthread_mutex_t
和boost::condition_variable
相比,boost::mutex
的开销恰好零 code> 具有额外的内部互斥体的开销。后者是否适合您的申请由您决定。PS 我鼓励您坚持事实,避免在帖子中使用煽动性语言。我差点决定忽略你的帖子,纯粹是因为它的语气。
On my 64-bit Ubuntu box, the following:
prints
This indicates that your claim that
is quite simply not true.
In case you're wondering where the extra 40 bytes taken by
boost::condition_variable
come from, the Boost class uses an internal mutex.In a nutshell, on this platform
boost::mutex
has exactly zero overhead compared topthread_mutex_t
, andboost::condition_variable
has the overhead of the extra internal mutex. Whether or not the latter is acceptable for your application is for you to decide.P.S. I would encourage you to stick to the facts and avoid using inflammatory language in your posts. I for one very nearly decided to ignore your post purely for its tone.
看看实现:
现在,让我删除非数据部分并重新排序:
因此,除了不可复制之外,我发现系统互斥体不会发生太多开销。
Looking at the implementation:
Now, let me strip the non-data parts and reorder:
So apart from
noncopyable
I see not much overhead that doesn't occur with system mutex's.抱歉,我在这里发表评论,但我没有足够的声誉来添加评论。
@FrankH,缓存垃圾并不是使数据结构变得更大的好理由。有些缓存行甚至可以有 128 字节大小,但这并不意味着互斥锁必须这么大。
我认为必须警告程序员将内存中的同步对象分开,这样它们就不会共享相同的缓存行。通过将对象插入到足够大的数据结构中,而不用未使用的字节使数据结构膨胀,可以实现什么目的。另一方面,插入未使用的字节会降低程序速度,因为 CPU 必须获取更多缓存行才能访问相同的结构。
@哈桑·赛义德,
我不认为互斥锁是在这种类型的缓存优化中进行编程思考的。相反,我认为这是它们被编程用于支持优先级继承、嵌套锁等思想的方式。作为建议,如果您的程序中需要大量互斥体,请考虑类似互斥体池(数组)的东西,并在节点中仅存储索引(当然要注意内存分离)。我让您思考这个解决方案的细节。
Sorry I comment this here, but I have no enough reputation for adding a comment.
@FrankH, cache trashing is not a good justification to make a data structure bigger. There are cache lines that can even have 128 bytes of size, it doesn't mean that a mutex must be so big.
I think programmers must be warned to separate synchronization objects in memory so they don't share the same cache line. What can be achieved by inserting the object in a big enough data structure, without bloating the data structure with unused bytes. On the other hand, inserting unused bytes can deteriorate the program speed, because the CPU has to fetch more cache line to access the same structure.
@Hassan Syed,
I don't think that mutexes were programmed thinking in this type of cache optimization. Instead, I think that this is the way they are programmed for supporting thinks like priority inheritance, nesting locks,... . As suggestion, if you need a lot of mutexes in your program, consider something like a pool (array) of mutexes and storing just an index in your nodes (of course taking care of memory separation). I let you to think on the details of this solution.
对于 Windows,请使用 Slim Reader/写入器 (SRW) 锁。据我所知,它的大小是 8 个字节,并且比常规互斥体更快。我建议 Microsoft 在 std::mutex 中使用该原语,但出于 ABI 可移植性的考虑,他们拒绝了这个想法。
在 Linux 上,我建议基于 FUTEX 原始的。在这种情况下,您只需要 4 个字节。不幸的是,我目前没有正确实现基于 futex 的互斥类。
unlock()
方法的简单实现将使用syscall(FUTEX_WAKE)
,但这不是一种高性能的方法。 AFAIK Ulrich Drepper 有一个 。另一种方法是旋转 std::atomic_bool 并调用 std::this_thread::yield() 函数。缺点是CPU核心利用率为100%。
For Windows use Slim Reader/Writer (SRW) Locks. Its size as far as I remember is 8 bytes and it is faster than regular mutex. I suggested Microsoft to use that primitive inside the
std::mutex
, but they rejected this idea for the sake of ABI portability.On Linux I would suggest to implement the mutex class based on the FUTEX primitive. In this case you would need just 4 bytes. Unfortunately I do not have the proper implementation of the futex-based-mutex class at this moment. The trivial implementation of the
unlock()
method would usesyscall(FUTEX_WAKE)
, but that is not a performant way. AFAIK Ulrich Drepper has one.Yet another way is to spin on
std::atomic_bool
and call thestd::this_thread::yield()
function. The drawback is 100% CPU core utilization.正如 @NPE 所说,GNU/Linux 上 libstdc++ 中的
boost::mutex
(和std::mutex
)是pthread_mutex_t
的零开销包装器,在 x86-64 上这是一个相当臃肿的 40 字节,或者在 i386 上是 24 字节。 (或者 x32 上的 32 个字节;具有 32 位长/指针的 64 位模式。)库内部使用 struct __pthread_mutex_s,在 glibc 的
x86/nptl/bits/struct_mutex.h< 中定义/代码>: (https://codebrowser.dev/glibc/glibc/sysdeps/ x86/nptl/bits/struct_mutex.h.html)
我假设 __pthread_list_t 包含 2 个指针,因此x86-64 和 x32 之间的尺寸差异。我还没有进一步深入研究所有这些字段的用途,但如果有人好奇的话,那就是这些字段。
Elision 大概适用于 Intel 的 TSX 事务内存等硬件功能,该功能过去包括硬件锁省略 (HLE),其中
xchg
或lock cmpxchg
或类似指令上的前缀,加上在mov
存储或另一个要解锁的 RMW 上,可以使关键部分成为原子事务,而无需实际竞争对互斥体本身的访问。因此,如果单独的线程实际上没有接触相同的缓存行,则粗略锁定可以获得并发的好处。我认为 HLE 由于某种安全错误而被放弃,只剩下 RTM(受限事务内存)仍然可以在某些 CPU 和操作系统上启用。它使用像
xbegin
/xend
这样的特殊指令,因此即使您想要,也无法透明地将互斥锁/解锁转换为事务。 (HLE 并不总是成功,因为中止/重试可能不如等待锁的所有权好。)无论如何,除非您使用旧的微代码,否则 HLE 已死,该字段可能是 x86 上的空间浪费 - 64,所以如果有新的 ABI 版本的 libpthread,也许会消失。
我不知道其他字段是做什么用的。希望有一些显着的好处,但我怀疑对于简单的互斥的常见情况可能没有。也许在检测错误线程解锁的所有权方面有调试好处。
pthread_mutex_t
的可见定义是一个具有右对齐的联合体和一个具有正确大小的char []
成员,如 https://codebrowser.dev/glibc/glibc/sysdeps/x86/nptl/bits/pthreadtypes-arch.h.html 可能是从sizeof(struct __pthread_mutex_s)< 自动生成的/code> 因为大小匹配。没有任何额外的字节只是为了使其更大,因此互斥体数组的每个缓存行将更少,正如 FrankH 的答案假设的那样;互斥体的大多数用途都与它们所保护的其他数据混合在一起,因此较大的互斥体只会占用宝贵的空间。
我不明白将其与数组和对齐成员联合的目的。
struct __pthread_mutex_s
的定义对于包含pthread.h
的代码仍然可见(Godbolt)。在__pthread_mutex_s
的第一个成员上使用alignas(long int)
将获得相同的布局,至少在像 System V 这样的健全的 ABI 中是这样。希望 Windows 也是如此。或者用于 C++11 之前的代码的 __attribute__((aligned(__alignof(long)))),例如当 NPTL pthreads 实现是新的时。
__alignof()
是一个 GNU C 内置函数,早于 C_Alignof
和 C++alignof
。As @NPE says,
boost::mutex
(andstd::mutex
) in libstdc++ on GNU/Linux are zero-overhead wrappers forpthread_mutex_t
, which is a fairly bloated 40 bytes on x86-64, or 24 bytes on i386. (Or 32 bytes on x32; 64-bit mode with 32-bit long / pointers.)The library internals use
struct __pthread_mutex_s
, defined in glibc'sx86/nptl/bits/struct_mutex.h
: (https://codebrowser.dev/glibc/glibc/sysdeps/x86/nptl/bits/struct_mutex.h.html)I assume
__pthread_list_t
includes 2 pointers, hence the size difference between x86-64 and x32. I haven't dug further into what all these fields are used for, but any anyone's curious, that's what's there.Elision is presumably for hardware features like Intel's TSX transactional memory, which used to include hardware lock elision (HLE) where a prefix on an
xchg
orlock cmpxchg
or similar instruction, plus on amov
store or another RMW to unlock, could make the critical section an atomic transaction without actually contending for access to the mutex itself. So coarse locking could get the benefit of concurrency if separate threads weren't actually touching the same cache lines.HLE That's since been dropped due to some kind of security bug I think, leaving only RTM (Restricted Transactional Memory) which can still be enabled on some CPUs and OSes. That uses special instructions like
xbegin
/xend
, so can't transparently turn a mutex lock/unlock into a transaction even if you wanted that. (HLE wasn't always a win since abort / retry can be less good than just waiting for ownership of the lock.)Anyway, with HLE dead unless you're using old microcode, that field is presumably a waste of space on x86-64, so maybe will disappear if there's ever a new ABI version of libpthread.
I don't know what the other fields are for. Hopefully there's some significant benefit, but I suspect there might not be for the common case of simple mutual exclusion. Maybe a debugging benefit in detecting ownership of wrong thread unlocking.
The visible definition of
pthread_mutex_t
is a union with the right alignment and achar []
member of the right size as defined in https://codebrowser.dev/glibc/glibc/sysdeps/x86/nptl/bits/pthreadtypes-arch.h.html which is probably auto-generated fromsizeof(struct __pthread_mutex_s)
since the size matches. There aren't any extra bytes just to make it larger so arrays of mutexes will have fewer per cache line as FrankH's answer hypothesizes; most uses of mutexes have one mixed in with other data they're protecting, so a larger mutex just takes up valuable space.I don't understand the purpose of making this a union with the array and align members. The definition of
struct __pthread_mutex_s
is still visible to code that includespthread.h
(Godbolt). Usingalignas(long int)
on the first member of__pthread_mutex_s
would have gotten the same layout, at least in sane ABIs like System V. And hopefully also Windows.Or
__attribute__((aligned(__alignof(long))))
for pre-C++11 code like when the NPTL pthreads implementation was new.__alignof()
is a GNU C builtin that predates C_Alignof
and C++alignof
.