并行与序列结果aent
下面的测试结果有些奇怪。 我进行了平行和串行循环,并将它们彼此进行了比较。 我将测试作为4个形状 Parallel , serial , 并行循环首先首先,然后串行回路,最后是序列循环平行循环。我将它们编码为P,S,PFSL和SFPL。
结果如表
的测试类型和## | 在Milli第二秒所示 |
---|---|
,直到= 4 | |
P#1 | 9,472 |
S#1 | 13,459 |
S #2 | 11,323 |
P#2 | 8,854 |
P#3 | 9,253 |
S#3 | 10,669 |
直到= 5 | |
Pfsl#1 | 1,421 |
8,299 | |
SFPL#1 | 1,708 |
6,280 | |
SFPL#1 | 1,657 |
6,334 | |
PFSL#2 | 1,400 |
8,191 | |
PFSL#3 | 1,443 |
8,488 | |
SFPL#3 | 1,784 |
6,475 |
有人可以解释这一点吗?第二种方法如何持续更长的时间?
您可以在
最小可重复的代码:
串行方法
for(int i = somenum; i >= until; i-- ){
foreach (var nue in nuelist)
{
foreach (var path in nue.pathlist)
{
foreach (var conn in nue.connlist)
{
Func(conn,path);
}
}
}
}
并行方法
for(int i = somenum; i >= until; i-- ){
Parallel.ForEach(nuelist,nue =>
{
Parallel.ForEach(nue.pathlist,path=>
{
Parallel.ForEach(nue.connlist, conn=>
{
Func(conn,path);
});
});
});
}
内的路径类中的
Nue firstnue;
string name;
List<Conn> Conns;
public void Func(Conn conn,Path path)
{
List<Conn> list = new(){conn};
list.AddRange(path.list);
_ = new Path(list);
}
public Path(List<Conn> conns)
{
//other things
Conns = new();
Conns = conns;
Paths.TryAdd(name,this);
firstnue.pathlist.Add(this);
/*
firstnue is another nue that will be
in the next iteration of for loop
*/
}
public static ConcurrentDictionary<string,Path> Paths = new();
内部nue类
public ConcurrentBag<Path> pathlist;
public Nue()
{
pathlist = new ConcurrentBag<Path>();
}
conn类
Nue From;
Nue To;
public Conn(Nue From, Nue To)
{
this.From = From;
this.To = To;
}
在MAIN方法中
using System.diagnostics;
StopWatch watch = new();
watch.start();
// for serial results, uncomment lines that are below
// serial(somenum = n,until = l);
// watch.stop();
// int s = watch.elapsed;
// for parallel results, uncomment lines that are below
// parallel(somenum = n,until = l);
// watch.stop();
// int p = watch.elapsed;
// for sfpl results, uncomment lines that are below
// serial(n,l);
// watch.stop();
// int sf = watch.elapsed;
// watch.restart();
// parallel(n,l);
// int pl = watch.elapsed;
// for pfsl results, uncomment lines that are below
// parallel(n,l);
// watch.stop();
// int pf = watch.elapsed;
// watch.restart();
// serial(n,l);
// int sl = watch.elapsed;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我对大型矩阵乘法的结果非常不同。在6核机上,正确的
并行。对于
,可以使性能提高9倍。对于核心而言,这是6次,另外3次是由于高线程:问题的代码不完整且很难阅读。除了将项目添加到列表外,似乎没有其他事情。不可能说出结果数是什么意思或为什么它们是这样。
但是,链接的项目提到了神经网络,因此有意义的实际基准是矩阵乘法。在MLP中,前馈阶段全部与矩阵乘法有关。
stopwatch
对于基准测试没有用,因为其他程序可以延迟执行。即使是多次执行相同的方法并平均结果,结果也不够,因为可能会有峰值,热身和缓存效果。这就是为什么如今几乎每个基准都使用 benchmarkdotnet package。 BDN只要必须运行基准,直到它收集足够的测量以提供统计正确的结果。乘法代码。
串行乘法方法是:
幼稚的并行版本简单地用
并行替换外循环。对于
,一个轻微的调整是在内部循环中使用临时变量,而不是直接写入结果数组:
偶数此外,从
matrixa
的行被复制到每个并行循环开始时的本地向量,以进一步减少缓存失误和非本地访问的机会:用于基准测试矩阵乘法以下.NET使用6代码。这是整个程序。CS文件
产生以下输出
运行此操作会在6核机上 ,而幼稚的并行方法比序列方法快4倍。当然,不是的速度。只需使用
temp
变量,但比串行方法更快地执行了6倍。将行复制到本地缓冲区中的性能比串行版本要好9倍,因为串行版本是因为内部循环中的行读取和写作会导致许多缓存失误。使用
temp
变量消除了其中一些。使用该行的副本减少了缓存的进一步和允许处理器利用超线程My results on large matrix multiplication are very different. On a 6-core machine, a correct
Parallel.For
can result in 9 times better performance. That's 6 times for the cores and another 3 due to hyperthreading:The question's code is incomplete and very hard to read. It doesn't seem to be doing anything other than adding items to lists too. It's impossible to say what the result numbers mean or why they are this way.
The linked project mentions neural networks though, so a meaningful real benchmark would be matrix multiplications. In an MLP, the feed-forward stage is all about matrix multiplications.
Stopwatch
isn't useful for benchmarking as execution can be delayed by other programs. Even executing the same method lots of times and averaging the results isn't enough, as there may be spikes, warmup and caching effects. That's why almost every benchmark these days uses the BenchmarkDotNet package. BDN will run a benchmark as long as it has to until it gathers enough measurements to provide a statistically correct result.The multiplication code was borrowed from this article.
The serial multiplication method is :
A naive parallel version simply replaces the outer loop with
Parallel.For
A slight tweak is to use a temporary variable for the inner loop instead of writing directly to the result array:
Going even further, the row from
matrixA
is copied into a local vector at the start of each parallel loop, to reduce even further the chance of cache misses and non-local access:To benchmark matrix multiplication the following .NET 6 code is used. This is the whole Program.cs file
Running this produces the following output
On a 6-core machine, the naive parallel method is 4 times faster than the serial method. That's certainly not as fast as it should be. Simply using the
temp
variable though resulted in 6x faster execution than the serial method. Copying the row into a local buffer results in 9 times better performance than the serial versionThat's because reading and writing across rows in the inner loop results in a lot of cache misses. Using the
temp
variable eliminated some of these. Using a copy of the row reduces cache misses even further and allows the processors to take advantage of hyperthreading