boost::thread 数据结构大小太荒谬了?

发布于 2024-11-26 07:45:52 字数 1015 浏览 3 评论 0原文

编译器:linux 上的 clang++ x86-64。

我已经有一段时间没有编写任何复杂的低级系统代码了,而且我通常针对系统原语(Windows 和 pthreads/posix)进行编程。所以,进出的事情已经从我的记忆中消失了。我目前正在使用 boost::asioboost::thread

为了针对异步函数执行器(使用多个线程 io::service::runboost::io_service ,其中请求是 io_serviced)模拟同步 RPC: :post'ed),我正在使用 boost 同步原语。出于好奇,我决定对原语进行sizeof。这就是我所看到的。

struct notification_object
{
  bool ready;
  boost::mutex m;
  boost::condition_variable v;
};
...
std::cout << sizeof(bool) << std::endl;
std::cout << sizeof(boost::mutex) << std::endl;
std::cout << sizeof(boost::condition_variable) << std::endl;
std::cout << sizeof(notification_object) << std::endl;
...

输出:

1
40
88
136

互斥体 40 个字节?? ?? ?卧槽! 88 表示条件变量!请记住,我对这种臃肿的大小感到厌恶,因为我正在考虑一个可以创建数百个 notification_object 的应用程序。

这种可移植性的开销似乎很荒谬,有人可以证明这一点吗?据我所知,这些原语应该是 4 或 8 字节宽,具体取决于 CPU 的内存模型。

Compiler: clang++ x86-64 on linux.

It has been a while since I have written any intricate low level system code, and I ussualy program against the system primitives (windows and pthreads/posix). So, the in#s and out's have slipped from my memory. I am working with boost::asio and boost::thread at the moment.

In order to emulate synchronous RPC against an asynchronous function executor (boost::io_service with multiple threads io::service::run'ing where requests are io_serviced::post'ed), I am using boost synchronization primitives. For curiosities sake I decided to sizeof the primitives. This is what I get to see.

struct notification_object
{
  bool ready;
  boost::mutex m;
  boost::condition_variable v;
};
...
std::cout << sizeof(bool) << std::endl;
std::cout << sizeof(boost::mutex) << std::endl;
std::cout << sizeof(boost::condition_variable) << std::endl;
std::cout << sizeof(notification_object) << std::endl;
...

Output:

1
40
88
136

Forty bytes for a mutex ?? ?? ? WTF ! 88 for a condition_variable !!! Please keep in mind that I'm repulsed by this bloated size because I am thinking of an application that could create hundreds of notification_object's

This level of overhead for portability seems ridiculous, can someone justify this? As far as I can remember these primitives should be 4 or 8 bytes wide depending on the memory model of the CPU.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

Smile简单爱 2024-12-03 07:45:52

当您查看任何类型的同步原语的“大小开销”时,请记住这些不能打包得太紧密。之所以如此,是因为例如共享缓存行的两个互斥体如果同时使用,最终会导致缓存垃圾(错误共享),即使获取这些锁的用户从不“冲突”。即想象两个线程运行两个循环:

for (;;) {
    lock(lockA);
    unlock(lockA);
}

for (;;) {
    lock(lockB);
    unlock(lockB);
}

运行一个循环的一个线程相比,在两个不同的线程上运行时,您将看到迭代次数的两倍当且仅这两个锁不在范围内相同的缓存行。如果 lockAlockB 位于同一缓存行中,则每个线程的迭代次数将减半 - 因为具有这两个锁的缓存行将永久在执行这两个线程的 cpu 核心之间反弹。

因此,即使自旋锁或互斥体底层的原始数据类型的实际数据大小可能只是一个字节或32位字,但此类对象的有效数据大小是通常更大。

在断言“我的互斥体太大”之前请记住这一点。事实上,在 x86/x64 上,40 字节太小,无法防止错误共享,因为缓存行目前至少有 64 字节。

除此之外,如果您高度关注内存使用情况,请考虑通知对象不必是唯一的 - 条件变量可以用于触发不同的事件(通过 boost::condition_variable 的 predicate 知道)。因此,可以对整个状态机使用单个互斥锁/CV 对,而不是每个状态一对。同样的情况也适用于线程池同步——拥有比线程更多的锁并不一定是有益的。

编辑:有关“错误共享”(以及在同一缓存行中托管多个原子更新变量所造成的负面性能影响)的更多参考,请参阅(除其他外)以下 SO 帖子:

如前所述,当使用多个“同步对象”时(无论是原子更新的变量、锁、信号量,...)在多核、每核缓存配置中,允许它们每个都有一个单独的缓存线空间。您在这里用内存使用来换取可扩展性,但实际上,如果您进入软件需要数百万个锁(需要 GB 内存)的区域,您要么有资金购买几百 GB 内存(以及一百个CPU核心),或者你在软件设计中做错了什么。

在大多数情况下(class / struct 的特定实例的锁/原子),只要包含的对象实例,您就可以免费获得“填充”原子变量足够大。

When you look at "size overhead" for any type of synchronization primitive, keep in mind that these cannot be packed too closely. That is so because e.g. two mutexes sharing a cacheline would end up in cache trashing (false sharing) if they're in-use concurrently, even if the users acquiring these locks never "conflict". I.e. imagine two threads running two loops:

for (;;) {
    lock(lockA);
    unlock(lockA);
}

and

for (;;) {
    lock(lockB);
    unlock(lockB);
}

You will see twice the number of iterations when run on two different threads compared to one thread running one loop if and only if the two locks are not within the same cacheline. If lockA and lockB are in the same cacheline, the number of iterations per thread will half - because the cacheline with those two locks in will permanently bounce between the cpu cores executing these two threads.

Hence even though the actual data size of the primitive data type underlying a spinlock or mutex might only be a byte or a 32bit word, the effective data size of such an object is often larger.

Keep that in mind before asserting "my mutexes are too large". In fact, on x86/x64, 40 Bytes is too small to prevent false sharing, as cachelines there are currently at least 64 Bytes.

Beyond that, if you're highly concerned about memory usage, consider that notification objects need not be unique - condition variables can serve to trigger for different events (via the predicate that boost::condition_variable knows about). It'd therefore be possible to use a single mutex/CV pair for a whole state machine instead of one such pair per state. Same goes for e.g. thread pool synchronization - having more locks than threads is not necessarily beneficial.

Edit: For a few more references on "false sharing" (and the negative performance impact caused by hosting multiple atomically-updated variables within the same cacheline), see (amongst others) the following SO postings:

As said, when using multiple "synchronization objects" (whether that'd be atomically-updated variables, locks, semaphores, ...) in a multi-core, cache-per-core config, allow each of them a separate cacheline of space. You're trading memory usage for scalability here, but really, if you get into the region where your software needs several millions of locks (making that GBs of mem), you either have the funding for a few hundred GB of memory (and a hundred CPU cores), or you're doing something wrong in your software design.

In most cases (a lock / an atomic for a specific instance of a class / struct), you get the "padding" for free as long as the object instance that contains the atomic variable is large enough.

故事未完 2024-12-03 07:45:52

在我的 64 位 Ubuntu 机器上,显示以下内容:

#include <pthread.h>
#include <stdio.h>

int main() {
  printf("sizeof(pthread_mutex_t)=%ld\n", sizeof(pthread_mutex_t));
  printf("sizeof(pthread_cond_t)=%ld\n", sizeof(pthread_cond_t));
  return 0;
}

打印

sizeof(pthread_mutex_t)=40
sizeof(pthread_cond_t)=48

这表明您声称

这种可移植性的开销似乎很荒谬,有人可以吗?
向我证明这一点合理吗?据我记得这些原语应该
宽度为 4 或 8 个字节,具体取决于 CPU 的内存模型。

这根本不是真的。

如果您想知道 boost::condition_variable 占用的额外 40 个字节从何而来,Boost 类使用内部互斥体。

简而言之,在此平台上,与 pthread_mutex_tboost::condition_variable 相比,boost::mutex 的开销恰好 code> 具有额外的内部互斥体的开销。后者是否适合您的申请由您决定。

PS 我鼓励您坚持事实,避免在帖子中使用煽动性语言。我差点决定忽略你的帖子,纯粹是因为它的语气。

On my 64-bit Ubuntu box, the following:

#include <pthread.h>
#include <stdio.h>

int main() {
  printf("sizeof(pthread_mutex_t)=%ld\n", sizeof(pthread_mutex_t));
  printf("sizeof(pthread_cond_t)=%ld\n", sizeof(pthread_cond_t));
  return 0;
}

prints

sizeof(pthread_mutex_t)=40
sizeof(pthread_cond_t)=48

This indicates that your claim that

This level of overhead for portability seems ridiculous, can someonee
justify this to me ? as far as I can remember these primitives should
be 4 or 8 bytes wide depending on the memory model of the CPU.

is quite simply not true.

In case you're wondering where the extra 40 bytes taken by boost::condition_variable come from, the Boost class uses an internal mutex.

In a nutshell, on this platform boost::mutex has exactly zero overhead compared to pthread_mutex_t, and boost::condition_variable has the overhead of the extra internal mutex. Whether or not the latter is acceptable for your application is for you to decide.

P.S. I would encourage you to stick to the facts and avoid using inflammatory language in your posts. I for one very nearly decided to ignore your post purely for its tone.

他夏了夏天 2024-12-03 07:45:52

看看实现:

class mutex : private noncopyable
{
public:
    friend class detail::thread::lock_ops<mutex>;

    typedef detail::thread::scoped_lock<mutex> scoped_lock;

    mutex();
    ~mutex();

private:
#if defined(BOOST_HAS_WINTHREADS)
    typedef void* cv_state;
#elif defined(BOOST_HAS_PTHREADS)
    struct cv_state
    {
        pthread_mutex_t* pmutex;
    };
#elif defined(BOOST_HAS_MPTASKS)
    struct cv_state
    {
    };
#endif
    void do_lock();
    void do_unlock();
    void do_lock(cv_state& state);
    void do_unlock(cv_state& state);

#if defined(BOOST_HAS_WINTHREADS)
    void* m_mutex;
#elif defined(BOOST_HAS_PTHREADS)
    pthread_mutex_t m_mutex;
#elif defined(BOOST_HAS_MPTASKS)
    threads::mac::detail::scoped_critical_region m_mutex;
    threads::mac::detail::scoped_critical_region m_mutex_mutex;
#endif
};

现在,让我删除非数据部分并重新排序:

class mutex : private noncopyable {
private:
#if defined(BOOST_HAS_WINTHREADS)
    void* m_mutex;
#elif defined(BOOST_HAS_PTHREADS)
    pthread_mutex_t m_mutex;
#elif defined(BOOST_HAS_MPTASKS)
    threads::mac::detail::scoped_critical_region m_mutex;
    threads::mac::detail::scoped_critical_region m_mutex_mutex;
#endif
};

因此,除了不可复制之外,我发现系统互斥体不会发生太多开销。

Looking at the implementation:

class mutex : private noncopyable
{
public:
    friend class detail::thread::lock_ops<mutex>;

    typedef detail::thread::scoped_lock<mutex> scoped_lock;

    mutex();
    ~mutex();

private:
#if defined(BOOST_HAS_WINTHREADS)
    typedef void* cv_state;
#elif defined(BOOST_HAS_PTHREADS)
    struct cv_state
    {
        pthread_mutex_t* pmutex;
    };
#elif defined(BOOST_HAS_MPTASKS)
    struct cv_state
    {
    };
#endif
    void do_lock();
    void do_unlock();
    void do_lock(cv_state& state);
    void do_unlock(cv_state& state);

#if defined(BOOST_HAS_WINTHREADS)
    void* m_mutex;
#elif defined(BOOST_HAS_PTHREADS)
    pthread_mutex_t m_mutex;
#elif defined(BOOST_HAS_MPTASKS)
    threads::mac::detail::scoped_critical_region m_mutex;
    threads::mac::detail::scoped_critical_region m_mutex_mutex;
#endif
};

Now, let me strip the non-data parts and reorder:

class mutex : private noncopyable {
private:
#if defined(BOOST_HAS_WINTHREADS)
    void* m_mutex;
#elif defined(BOOST_HAS_PTHREADS)
    pthread_mutex_t m_mutex;
#elif defined(BOOST_HAS_MPTASKS)
    threads::mac::detail::scoped_critical_region m_mutex;
    threads::mac::detail::scoped_critical_region m_mutex_mutex;
#endif
};

So apart from noncopyable I see not much overhead that doesn't occur with system mutex's.

友欢 2024-12-03 07:45:52

抱歉,我在这里发表评论,但我没有足够的声誉来添加评论。

@FrankH,缓存垃圾并不是使数据结构变得更大的好理由。有些缓存行甚至可以有 128 字节大小,但这并不意味着互斥锁必须这么大。

我认为必须警告程序员将内存中的同步对象分开,这样它们就不会共享相同的缓存行。通过将对象插入到足​​够大的数据结构中,而不用未使用的字节使数据结构膨胀,可以实现什么目的。另一方面,插入未使用的字节会降低程序速度,因为 CPU 必须获取更多缓存行才能访问相同的结构。

@哈桑·赛义德,
我不认为互斥锁是在这种类型的缓存优化中进行编程思考的。相反,我认为这是它们被编程用于支持优先级继承、嵌套锁等思想的方式。作为建议,如果您的程序中需要大量互斥体,请考虑类似互斥体池(数组)的东西,并在节点中仅存储索引(当然要注意内存分离)。我让您思考这个解决方案的细节。

Sorry I comment this here, but I have no enough reputation for adding a comment.

@FrankH, cache trashing is not a good justification to make a data structure bigger. There are cache lines that can even have 128 bytes of size, it doesn't mean that a mutex must be so big.

I think programmers must be warned to separate synchronization objects in memory so they don't share the same cache line. What can be achieved by inserting the object in a big enough data structure, without bloating the data structure with unused bytes. On the other hand, inserting unused bytes can deteriorate the program speed, because the CPU has to fetch more cache line to access the same structure.

@Hassan Syed,
I don't think that mutexes were programmed thinking in this type of cache optimization. Instead, I think that this is the way they are programmed for supporting thinks like priority inheritance, nesting locks,... . As suggestion, if you need a lot of mutexes in your program, consider something like a pool (array) of mutexes and storing just an index in your nodes (of course taking care of memory separation). I let you to think on the details of this solution.

吻泪 2024-12-03 07:45:52

对于 Windows,请使用 Slim Reader/写入器 (SRW) 锁。据我所知,它的大小是 8 个字节,并且比常规互斥体更快。我建议 Microsoft 在 std::mutex 中使用该原语,但出于 ABI 可移植性的考虑,他们拒绝了这个想法。

在 Linux 上,我建议基于 FUTEX 原始的。在这种情况下,您只需要 4 个字节。不幸的是,我目前没有正确实现基于 futex 的互斥类。 unlock() 方法的简单实现将使用 syscall(FUTEX_WAKE),但这不是一种高性能的方法。 AFAIK Ulrich Drepper 有一个

另一种方法是旋转 std::atomic_bool 并调用 std::this_thread::yield() 函数。缺点是CPU核心利用率为100%。

For Windows use Slim Reader/Writer (SRW) Locks. Its size as far as I remember is 8 bytes and it is faster than regular mutex. I suggested Microsoft to use that primitive inside the std::mutex, but they rejected this idea for the sake of ABI portability.

On Linux I would suggest to implement the mutex class based on the FUTEX primitive. In this case you would need just 4 bytes. Unfortunately I do not have the proper implementation of the futex-based-mutex class at this moment. The trivial implementation of the unlock() method would use syscall(FUTEX_WAKE), but that is not a performant way. AFAIK Ulrich Drepper has one.

Yet another way is to spin on std::atomic_bool and call the std::this_thread::yield() function. The drawback is 100% CPU core utilization.

一百个冬季 2024-12-03 07:45:52

正如 @NPE 所说,GNU/Linux 上 libstdc++ 中的 boost::mutex (和 std::mutex)是 pthread_mutex_t 的零开销包装器,在 x86-64 上这是一个相当臃肿的 40 字节,或者在 i386 上是 24 字节。 (或者 x32 上的 32 个字节;具有 32 位长/指针的 64 位模式。)

库内部使用 struct __pthread_mutex_s,在 glibc 的 x86/nptl/bits/struct_mutex.h< 中定义/代码>: (https://codebrowser.dev/glibc/glibc/sysdeps/ x86/nptl/bits/struct_mutex.h.html

// bits/struct_mutex.h
struct __pthread_mutex_s
{
  int __lock;
  unsigned int __count;
  int __owner;
#ifdef __x86_64__
  unsigned int __nusers;
#endif
  /* KIND must stay at this position in the structure to maintain
     binary compatibility with static initializers.  */
  int __kind;
#ifdef __x86_64__
  short __spins;
  short __elision;
  __pthread_list_t __list;
# define __PTHREAD_MUTEX_HAVE_PREV      1
#else
    // non-x86-64 side of the ifdef
  unsigned int __nusers;      // this member seems common to both sides
  __extension__ union
  {
    struct
    {
      short __espins;
      short __eelision;
# define __spins __elision_data.__espins
# define __elision __elision_data.__eelision
    } __elision_data;
    __pthread_slist_t __list;
  };
# define __PTHREAD_MUTEX_HAVE_PREV      0
#endif       // __x86_64__
};

我假设 __pthread_list_t 包含 2 个指针,因此x86-64 和 x32 之间的尺寸差异。我还没有进一步深入研究所有这些字段的用途,但如果有人好奇的话,那就是这些字段。

Elision 大概适用于 Intel 的 TSX 事务内存等硬件功能,该功能过去包括硬件锁省略 (HLE),其中 xchglock cmpxchg 或类似指令上的前缀,加上在 mov 存储或另一个要解锁的 RMW 上,可以使关键部分成为原子事务,而无需实际竞争对互斥体本身的访问。因此,如果单独的线程实际上没有接触相同的缓存行,则粗略锁定可以获得并发的好处。

我认为 HLE 由于某种安全错误而被放弃,只剩下 RTM(受限事务内存)仍然可以在某些 CPU 和操作系统上启用。它使用像 xbegin / xend 这样的特殊指令,因此即使您想要,也无法透明地将互斥锁/解锁转换为事务。 (HLE 并不总是成功,因为中止/重试可能不如等待锁的所有权好。)

无论如何,除非您使用旧的微代码,否则 HLE 已死,该字段可能是 x86 上的空间浪费 - 64,所以如果有新的 ABI 版本的 libpthread,也许会消失。

我不知道其他字段是做什么用的。希望有一些显着的好处,但我怀疑对于简单的互斥的常见情况可能没有。也许在检测错误线程解锁的所有权方面有调试好处。


pthread_mutex_t 的可见定义是一个具有右对齐的联合体和一个具有正确大小的 char [] 成员,如 https://codebrowser.dev/glibc/glibc/sysdeps/x86/nptl/bits/pthreadtypes-arch.h.html 可能是从 sizeof(struct __pthread_mutex_s)< 自动生成的/code> 因为大小匹配。没有任何额外的字节只是为了使其更大,因此互斥体数组的每个缓存行将更少,正如 FrankH 的答案假设的那样;互斥体的大多数用途都与它们所保护的其他数据混合在一起,因此较大的互斥体只会占用宝贵的空间。

// bits/pthreadtypes.h, included from <pthread.h>
typedef union
{
  struct __pthread_mutex_s __data;
  char __size[__SIZEOF_PTHREAD_MUTEX_T];
  long int __align;
} pthread_mutex_t;

我不明白将其与数组和对齐成员联合的目的。 struct __pthread_mutex_s 的定义对于包含 pthread.h 的代码仍然可见(Godbolt)。在 __pthread_mutex_s 的第一个成员上使用 alignas(long int) 将获得相同的布局,至少在像 System V 这样的健全的 ABI 中是这样。希望 Windows 也是如此。

或者用于 C++11 之前的代码的 __attribute__((aligned(__alignof(long)))),例如当 NPTL pthreads 实现是新的时。 __alignof() 是一个 GNU C 内置函数,早于 C _Alignof 和 C++ alignof

As @NPE says, boost::mutex (and std::mutex) in libstdc++ on GNU/Linux are zero-overhead wrappers for pthread_mutex_t, which is a fairly bloated 40 bytes on x86-64, or 24 bytes on i386. (Or 32 bytes on x32; 64-bit mode with 32-bit long / pointers.)

The library internals use struct __pthread_mutex_s, defined in glibc's x86/nptl/bits/struct_mutex.h: (https://codebrowser.dev/glibc/glibc/sysdeps/x86/nptl/bits/struct_mutex.h.html)

// bits/struct_mutex.h
struct __pthread_mutex_s
{
  int __lock;
  unsigned int __count;
  int __owner;
#ifdef __x86_64__
  unsigned int __nusers;
#endif
  /* KIND must stay at this position in the structure to maintain
     binary compatibility with static initializers.  */
  int __kind;
#ifdef __x86_64__
  short __spins;
  short __elision;
  __pthread_list_t __list;
# define __PTHREAD_MUTEX_HAVE_PREV      1
#else
    // non-x86-64 side of the ifdef
  unsigned int __nusers;      // this member seems common to both sides
  __extension__ union
  {
    struct
    {
      short __espins;
      short __eelision;
# define __spins __elision_data.__espins
# define __elision __elision_data.__eelision
    } __elision_data;
    __pthread_slist_t __list;
  };
# define __PTHREAD_MUTEX_HAVE_PREV      0
#endif       // __x86_64__
};

I assume __pthread_list_t includes 2 pointers, hence the size difference between x86-64 and x32. I haven't dug further into what all these fields are used for, but any anyone's curious, that's what's there.

Elision is presumably for hardware features like Intel's TSX transactional memory, which used to include hardware lock elision (HLE) where a prefix on an xchg or lock cmpxchg or similar instruction, plus on a mov store or another RMW to unlock, could make the critical section an atomic transaction without actually contending for access to the mutex itself. So coarse locking could get the benefit of concurrency if separate threads weren't actually touching the same cache lines.

HLE That's since been dropped due to some kind of security bug I think, leaving only RTM (Restricted Transactional Memory) which can still be enabled on some CPUs and OSes. That uses special instructions like xbegin / xend, so can't transparently turn a mutex lock/unlock into a transaction even if you wanted that. (HLE wasn't always a win since abort / retry can be less good than just waiting for ownership of the lock.)

Anyway, with HLE dead unless you're using old microcode, that field is presumably a waste of space on x86-64, so maybe will disappear if there's ever a new ABI version of libpthread.

I don't know what the other fields are for. Hopefully there's some significant benefit, but I suspect there might not be for the common case of simple mutual exclusion. Maybe a debugging benefit in detecting ownership of wrong thread unlocking.


The visible definition of pthread_mutex_t is a union with the right alignment and a char [] member of the right size as defined in https://codebrowser.dev/glibc/glibc/sysdeps/x86/nptl/bits/pthreadtypes-arch.h.html which is probably auto-generated from sizeof(struct __pthread_mutex_s) since the size matches. There aren't any extra bytes just to make it larger so arrays of mutexes will have fewer per cache line as FrankH's answer hypothesizes; most uses of mutexes have one mixed in with other data they're protecting, so a larger mutex just takes up valuable space.

// bits/pthreadtypes.h, included from <pthread.h>
typedef union
{
  struct __pthread_mutex_s __data;
  char __size[__SIZEOF_PTHREAD_MUTEX_T];
  long int __align;
} pthread_mutex_t;

I don't understand the purpose of making this a union with the array and align members. The definition of struct __pthread_mutex_s is still visible to code that includes pthread.h (Godbolt). Using alignas(long int) on the first member of __pthread_mutex_s would have gotten the same layout, at least in sane ABIs like System V. And hopefully also Windows.

Or __attribute__((aligned(__alignof(long)))) for pre-C++11 code like when the NPTL pthreads implementation was new. __alignof() is a GNU C builtin that predates C _Alignof and C++ alignof.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文