C - __declspec(thread) 变量性能

发布于 2024-10-19 03:43:45 字数 512 浏览 1 评论 0原文

我正在研究一个库的多线程实现。在该库的一个模块中,有一些全局变量(在程序执行中经常使用)。为了使对这些变量的访问更加安全,我使用线程本地存储 (TLS) 关键字 __declspec(thread) 声明它们。

这是对库外部函数的调用。该函数使用带有全局变量的模块:

for(i = 0; i<n_cores; i++)
    hth[i] = (HANDLE)_beginthread((void(*)(void*))MT_Interface_DimenMultiCells,0,(void*)&inputSet[i]);

这样我猜库中使用的所有变量都会为每个线程重复。

当我在 x8 核处理器上运行程序时,完成操作所需的时间不会超过单进程实现所需时间的 1/3。

我知道不可能达到1/8的时间,但我想至少1/6是可以达到的。

问题是:这些 __declspec(thread) 变量是性能如此糟糕的原因吗?

I'm working on the multithreading implementation of a library. In one module of this library there are some global variables (very often used in the program execution). In order to make the access to those variables more safe, I declared them using the Thread-local storage (TLS) keyword __declspec(thread).

Here is the call to the library external function. This function uses the module with the global variables:

for(i = 0; i<n_cores; i++)
    hth[i] = (HANDLE)_beginthread((void(*)(void*))MT_Interface_DimenMultiCells,0,(void*)&inputSet[i]);

In this way I guess all the variables used in the library will be duplicated for each thread.

When I run the program on a x8 cores processor, the time required to complete the operation doesn't go further than 1/3 the time needed for the single process implementation.

I know that it is impossible to reach 1/8 of the time, but i thought that at least 1/6 was reachable.

The question is: are those __declspec(thread) variables the cause of so bad performances?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

婴鹅 2024-10-26 03:43:45

如果将它们声明为 __declspec(thread) ,而它们以前是全局的,那么您就改变了程序的含义及其性能特征。

当变量是全局变量时,每个线程都会引用一个副本。作为线程局部变量,每个单独的线程都有自己的变量,并且对该线程局部变量的更改仅在该线程中可见。

假设您确实想要线程局部变量,那么读写线程局部变量确实比普通变量更昂贵。每当您遇到需要很长时间才能执行的操作时,最好的解决方案就是完全停止执行该操作。在这种情况下,有两种明显的方法可以做到这一点:

  1. 将变量作为参数传递,以便它驻留在堆栈上。访问堆栈变量很快。
  2. 如果您有经常读写此变量的函数,则在函数开头获取它的副本(放入局部变量中),处理该局部变量,然后在返回时将其写回线程局部变量。

在这些选项中,通常首选前者。选项 2 有一个很大的弱点,即如果函数调用另一个使用此变量的函数,则很难应用它。

选项 1 基本上相当于不使用全局变量(线程局部变量是全局变量的一种形式)。

当然,这一切可能完全偏离了主题,因为您对代码实际作用的描述太少了。如果你想解决性能问题,你首先必须确定问题出在哪里,这意味着你需要进行测量。

If you declare them as __declspec(thread) where they were previously global, then you have changed the meaning of the program, as well as its performance characteristics.

When the variable was a global there was a single copy that each thread referred to. As a thread local, each separate thread has its own variable and changes to that thread local variable are only visible in that thread.

Assuming that you really want thread local then it is true that reading and writing thread local variables is more expensive than normal variables. Whenever you are faced with an operation that takes a long time to perform, the best solution is to stop doing it at all. In this case there are two obvious ways to do so:

  1. Pass the variable around as a parameter so that it resides on the stack. Accessing stack variables is quick.
  2. If you have functions that read and write this variable a lot, then take a copy of it at the start of the function (into a local variable), work on that local variable, and then on return, write it back to the thread local.

Of these options the former is usually to be preferred. Option 2 has the big weakness that it can't easily be applied if the function calls another function that uses this variable.

Option 1 basically amounts to not using global variables (thread locals are a form of global).

This all may be completely wide of the mark of course, because you have said so little about what your code is actually doing. If you want to solve a performance problem, you first have to identify where it is, and that means you need to measure.

桜花祭 2024-10-26 03:43:45

答案是:您需要分析应用程序,并测量花费最多时间的地方。如果事实证明它在经常引用 TLS 数据的函数中,那么“也许”可能就是答案。

即使在您自己编写的代码中,通常很难找出性能不佳的原因:在两个短段落描述的程序中远程执行此操作甚至更加困难。

配置文件,然后优化。

And the answer is: you need to profile the application, and measure where the most time is being spent. If it turns out to be in functions that often reference the TLS data, then "maybe" could be the answer.

It's generally very hard to pick out the reasons for bad performance even in code you've written yourself: doing it remotely in a program described in two short paragraphs is even harder.

Profile, then optimize.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文