如果只使用 1 个线程,我可以轻松编写一个程序来利用 Intel 的四核或 i7 芯片吗?
我想知道如果我的程序中只有 1 个线程,我可以编写它以便四核或 i7 实际上可以利用不同的核心吗? 通常,当我在四核计算机上编写程序时,CPU 使用率只会达到 25% 左右,并且工作似乎被分配给 4 个核心,如任务管理器所示。 (我写的程序通常是Ruby、Python或PHP,所以它们可能没有那么优化)。
更新:如果我用 C 或 C++ 编写它,然后
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
使用编译器的最高级别优化会怎么样。 编译器能否使乘法发生在一个内核上,而加法发生在另一个内核上,从而使 2 个内核同时工作? 使用 2 个核心不是一个相当简单的优化吗?
I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
隐式并行可能就是您正在寻找的。
Implicit parallelism is probably what you are looking for.
如果您的应用程序代码是单线程的,则仅在以下情况下才会使用多个处理器/内核:
Ruby,然而,Python 和 PHP 应用程序都可以编写为使用多线程。
If your application code is single-threaded multiple processors/cores will only be used if:
Ruby, Python and PHP applications can all be written to use multiple threads, however.
一个单线程程序只会使用一个核心。 操作系统很可能决定不时地在核心之间转移程序 - 根据一些规则来平衡负载等。因此,您将看到总体使用率仅为 25%,并且所有四个核心都在工作 - 但同时只有一个核心在工作。
A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.
不使用多线程而使用多核的唯一方法是使用多个程序。
在上面的示例中,一个程序可以处理 0-2499999,下一个程序可以处理 2500000-4999999,依此类推。 同时关闭所有四个,它们将使用所有四个核心。
通常,您最好编写一个(单)多线程程序。
The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.
对于 C/C++,您可以使用 OpenMP。 这是带有编译指示的 C 代码,就像
说这个 for 将并行运行。
这是一种并行化某些东西的简单方法,但有时您必须了解并行程序如何执行,并且会遇到并行编程错误。
With C/C++ you can use OpenMP. It's C code with pragmas like
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.
如果您想并行选择评估为“true”的“i”,您的语句
if (a == ... || b == ...)
那么您可以这样做使用 PLINQ(在 .NET 4.0 中):如果您想要并行化操作,您将能够执行以下操作:
If you want to parallel the choice of the "i"s that evaluate to "true" your statement
if (a == ... || b == ...)
then you can do this with PLINQ (in .NET 4.0):If, instead, you want to parallelize operations, you will be able to do:
由于您正在谈论“任务管理器”,因此您似乎正在 Windows 上运行。 然而,如果您在那里运行一个具有多个进程的网络服务器(对于具有 fcgi 或 Apache 预分叉的 Ruby 或 PHP,以及较小程度的其他 Apache 工作人员),那么它们往往会分布在多个核心上。
如果只有一个没有线程的程序正在运行,那么,不,不会产生任何显着的优势 - 除了操作系统驱动的后台进程之外,您一次只会破坏一件事。
Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.
不。您需要使用线程在多个 CPU(无论是真实的还是虚拟的)上同时执行多个路径...一个线程的执行本质上绑定到一个 CPU,因为这维护了语句之间的“先发生”关系,这是核心程序如何工作。
No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.
首先,除非在程序中创建多个线程,否则该程序中只有一个执行线程。
如果看到该程序使用了 25% 的 CPU 资源,则表明四个核心中有一个核心的利用率为 100%,但所有其他核心均未使用。 如果使用所有核心,那么理论上该进程可能会占用 100% 的 CPU 资源。
顺便说一句,Windows 任务管理器中显示的图表是当时运行的所有进程的 CPU 利用率,而不仅仅是一个进程的 CPU 利用率。
其次,您提供的代码可以分为可以在两个单独的线程上执行的代码,以便在两个核心上执行。 我猜您想表明
a
和b
彼此独立,并且它们仅依赖于i
。 在这种情况下,像下面这样分离for
循环的内部可能会允许多线程操作,从而提高性能:但是,棘手的是是否需要有一个时间需要评估两个单独线程的结果,就像稍后的 if 语句所暗示的那样:
这将要求 a 和 b 驻留在单独线程(在单独处理器上执行)中的值需要查找,这是一个非常令人头痛的问题。
没有真正好的保证两个线程的 i 值同时相同(毕竟,乘法和加法可能需要不同的执行时间),这意味着在比较与依赖值对应的
a
和b
之前,一个线程可能需要等待另一个线程的i
值同步。代码>我。 或者,我们是否创建第三个线程用于两个线程的值比较和同步? 无论哪种情况,复杂性都开始快速增加,所以我认为我们可以同意我们开始看到严重的混乱——在线程之间共享状态可能非常棘手。因此,您提供的代码示例只是部分可并行化,不需要太多努力,但是,一旦需要比较两个变量,分离两个操作很快就会变得非常困难。
涉及并发编程时的一些经验法则:
当任务可以分解为涉及完全独立于其他数据及其结果(状态)的数据处理的部分时,并行化会非常容易。
例如,两个从输入计算值的函数(以伪代码形式):
这两个函数不相互依赖,因此它们可以毫无痛苦地并行执行。 此外,由于它们不是在计算之间共享或处理的状态,即使需要计算多个
x
值,甚至这些也可以进一步拆分:现在,在这个示例中,我们可以有 8 个独立的线程执行计算。 对于并发编程来说,没有副作用可能是一件非常好的事情。
然而,一旦依赖于其他计算的数据和结果(这也意味着存在副作用),并行化就变得极其困难。 在许多情况下,这些类型的问题必须串行执行,因为它们等待其他计算返回的结果。
也许问题归结为,为什么编译器不能找出可以自动并行化的部分并执行这些优化? 我不是编译器方面的专家,所以我不能说,但是有一篇关于自动并行化的文章< /a> 在维基百科上可能有一些信息。
First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that
a
andb
are independent of each other, and they only depend oni
. With that type of situation, separating the inside of thefor
loop like the following could allow multi-threaded operation which could lead to increased performance:However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the
if
statement later on:This would require that the
a
andb
values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.There is no real good guarantee that the
i
values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for thei
values to get in sync before comparing thea
andb
that corresponds to the dependent valuei
. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of
x
that needed to be calculated, even those can be split up further:Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.
我非常了解英特尔芯片。
根据您的代码,“if (a == ... || b == ...)”是一个障碍,否则处理器核心将并行执行所有代码,无论编译器做了什么样的优化。 那只要求编译器不是一个非常“愚蠢”的编译器。 这意味着硬件本身具有能力,而不是软件。 因此,在这种情况下,线程编程或 OpenMP 并不是必需的,尽管它们有助于改进并行计算。 注意这里并不意味着超线程,只是普通的多核处理器功能。
请谷歌“处理器管道多端口并行”以了解更多信息。
这里我想举一个经典的例子,可以通过多核/多通道IMC平台(例如Intel Nehalem系列,如Core i7)并行执行,不需要额外的软件优化。
为什么? 3个原因。
1 Core i7具有三通道IMC,其总线宽度为192位,每通道64位; 内存地址空间在每个高速缓存行的基础上在通道之间交错。 缓存行长度为 64 字节。 所以基本上 buffer0 位于通道 0 上,buffer1 位于通道上,buffer2 位于通道 2 上; 而对于 buffer[192],它在 3 个通道之间均匀地交错,每个通道 64 个。 IMC 支持同时从多个通道加载数据或向多个通道存储数据。 这是具有最大吞吐量的多通道 MC 突发。 在下面的描述中,我只会说每个通道 64 字节,即每个通道带有 BL x8(突发长度 8,8 x 8 = 64 字节 = 高速缓存行)。
2 buffer0..2 和 buffer 在内存空间中是连续的(在虚拟和物理上的特定页面上,堆栈内存)。 运行时,buffer0、1、2 和 buffer 被加载/提取到处理器缓存中,总共 6 个缓存行。 所以在开始执行上面的“for(){}”代码后,根本不需要访问内存,因为所有数据都在缓存中,三级缓存是非核心部分,由所有核心共享。 这里我们不讨论L1/2。 在这种情况下,每个核心都可以拾取数据,然后独立计算它们,唯一的要求是操作系统支持MP并且允许窃取任务,例如运行时调度和亲和力共享。
3 buffer0、1、2 和 buffer 之间没有任何依赖关系,因此不存在执行停顿或障碍。 例如,execute *(buffer + 64 + i) = *(buffer1 + i) 不需要等待 *(buffer + i) = *(buffer0 + i) 的执行完成。
不过,最重要和最困难的一点是“窃取任务、运行时调度和亲和性共享”,这是因为对于给定的任务,只有一个任务执行上下文,并且应该由所有核心共享以执行并行执行。 任何人如果能明白这一点,他/她就是世界顶尖的专家。 我正在寻找这样的专家来配合我的开源项目,负责并行计算和最新的 HPC 架构相关工作。
请注意,在上面的示例代码中,您还可以使用一些 SIMD 指令,例如 movntdq/a,它将绕过处理器缓存并直接写入内存。 在进行软件级优化时,这也是一个非常好的主意,尽管访问内存非常昂贵,例如访问缓存(L1)可能只需要 1 个周期,但在以前的 x86 芯片上访问内存需要 142 个周期。
请访问 http://effocore.googlecode.com 和 http://effogpled.googlecode.com 了解详细信息。
I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.