修复分布式版本中的算术错误

发布于 2025-01-27 20:24:22 字数 1646 浏览 3 评论 0 原文

在分布式环境中，我正在通过浓度分解矩阵，因为它被讨论了在这里。我的代码工作正常，但是为了测试我的分布式项目可以产生正确的结果，我必须将其与串行版本进行比较。结果并不完全相同！

例如，结果矩阵的最后五个单元格是：

serial gives:
-250207683.634793 -1353198687.861288 2816966067.598196 -144344843844.616425 323890119928.788757
distributed gives:
-250207683.634692 -1353198687.861386 2816966067.598891 -144344843844.617096 323890119928.788757

我在 intel论坛关于这一点，但是我得到的答案是在我将在分布式版本中获得的所有执行中获得相同的结果，这是我已经拥有的。他们似乎（在另一个线程中）无法对此做出回应：

如何在串行和分布式执行之间获得相同的结果？这可能吗？这将导致固定算术误差。

我尝试设置以下设置： mkl_cbwr_set（mkl_cbwr_avx）; 和使用 mkl_malloc（），以使存储器对齐，但没有改变。我将获得相同的结果，只有在我将一个流程为分布式版本中产生一个过程（这几乎会序列化）！

我正在调用的分布式例程： and 。

我要调用的串行例程： and 。

原文

I am inverting a matrix via a Cholesky factorization, in a distributed environment, as it was discussed here. My code works fine, but in order to test that my distributed project produces correct results, I had to compare it with the serial version. The results are not exactly the same!

For example, the last five cells of the result matrix are:

serial gives:
-250207683.634793 -1353198687.861288 2816966067.598196 -144344843844.616425 323890119928.788757
distributed gives:
-250207683.634692 -1353198687.861386 2816966067.598891 -144344843844.617096 323890119928.788757

I had post in the Intel forum about that, but the answer I got was about getting the same results across all the executions I will make with the distributed version, something that I already had. They seem (in another thread) to be unable to respond to this:

How to get same results, between serial and distributed execution? Is this possible? This would result in fixing the arithmetic error.

I have tried setting this: mkl_cbwr_set(MKL_CBWR_AVX); and using mkl_malloc(), in order to align memory, but nothing changed. I will get the same results, only in the case that I will spawn one process for the distributed version (which will make it almost serial)!

The distributed routines I am calling: pdpotrf() and pdpotri().

The serial routines I am calling: dpotrf() and dpotri().

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉鱼一梦 2025-02-03 20:24:22

您的差异似乎出现在第12 sf上，因为浮点算术不是真正的关联性（也就是说，fp算术不能保证 a+（b+c）==（a+b）+c ），并且由于并行执行通常没有给出操作应用的确定性顺序，因此与它们的串行等效相比，这些小差异是并行数值代码的典型差异。实际上，当在不同数量的处理器上运行时，您可能会观察到相同的差异顺序，例如4 vs 8。

不幸的是，获得确定性结果的简单方法是坚持执行。要从并行执行中获得确定性结果，需要一项重大努力，以对操作执行顺序非常具体，直到最后一个+或*，几乎可以肯定地排除了该使用大多数数字库，并导致您对大型数字例程进行艰苦的手动编码。

在大多数情况下，我遇到的输入数据的准确性通常是从传感器衍生而来的，并不保证担心第十二或以后的SF，我不知道您的数字代表什么，但对于许多科学家和工程师来说，对于所有实际目的，第五sf足以平等。对于数学家来说，这是另一回事...

回复收藏 0 原文

掀纱窥君容 2025-02-03 20:24:22

由于其他答案提到在串行和分布式之间获得完全相同的结果，因此不能保证。 HPC/分布式工作负载的一种常见技术是验证解决方案。从计算百分比误差到更复杂的验证方案，有许多技术，例如。这是一个简单的C ++函数，可以计算百分比误差。正如@highperformancemark在他的帖子中指出的那样，对这种数值错误的分析非常复杂。这是一种非常简单的方法，并且有很多有关该主题的信息。

#include <iostream>
#include <cmath>

double calc_error(double a,double x)
{
  return std::abs(x-a)/std::abs(a);
}
int main(void)
{
  double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
  double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
  double err[5];
  std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
  for (int it=0; it<5; it++) {
    err[it]=calc_error(sans[it], pans[it]);
    std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
  }
return 0;
}

会产生此输出：

Serial Answer,Distributed Answer, Error
-2.50208e+08,-2.50208e+08,4.03665e-13
-1.3532e+09,-1.3532e+09,7.24136e-14
2.81697e+09,2.81697e+09,2.46631e-13
-1.44345e+11,-1.44345e+11,4.65127e-15
3.2389e+11,3.2389e+11,0

如您所见，在每种情况下，误差的数量级都在10^-13或更少，在一个情况下不存在。根据您试图在此数量级上解决错误的问题，可以认为可以接受。希望这有助于说明一种针对串行验证解决方案验证分布式解决方案的一种方法，或者至少提供了一种方式来显示平行和串行算法的距离。

在验证大问题和并行算法的答案时，执行多个并行算法的运行也可能很有价值，从而保留了每次运行的结果。然后，您可以查看结果和/或误差是否随着并行算法而变化，或者随着时间的推移是否会安静下来。

表明并行算法在1000次运行的可接受阈值中产生错误（只是一个示例，对此类问题的越好）对于各种问题尺寸）是一种评估结果有效性的方法。

过去，当我进行了基准测试时，我注意到在服务器“热身”之前，在前几次跑步中，我注意到行为差异很大。当时，我从不费力地检查结果中的错误是否会像性能一样稳定，但是很有趣。

As the other answer mentions getting the exact same results between serial and distributed is not guaranteed. One common technique with HPC/distributed workloads is to validate the solution. There are a number of techniques from calculating percent error to more complex validation schemes, like the one used by the HPL. Here is a simple C++ function that calculates percent error. As @HighPerformanceMark notes in his post the analysis of this sort of numerical error is incredibly complex; this is a very simple method, and there is a lot of info available online about the topic.

#include <iostream>
#include <cmath>

double calc_error(double a,double x)
{
  return std::abs(x-a)/std::abs(a);
}
int main(void)
{
  double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
  double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
  double err[5];
  std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
  for (int it=0; it<5; it++) {
    err[it]=calc_error(sans[it], pans[it]);
    std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
  }
return 0;
}

Which produces this output:

Serial Answer,Distributed Answer, Error
-2.50208e+08,-2.50208e+08,4.03665e-13
-1.3532e+09,-1.3532e+09,7.24136e-14
2.81697e+09,2.81697e+09,2.46631e-13
-1.44345e+11,-1.44345e+11,4.65127e-15
3.2389e+11,3.2389e+11,0

As you can see the order of magnitude of the error in every case is on the order of 10^-13 or less and in one case non-existent. Depending on the problem you are trying to solve error on this order of magnitude could be considered acceptable. Hopefully this helps to illustrate one way of validating a distributed solution against a serial one, or at least gives one way to show how far apart the parallel and serial algorithm are.

When validating answers for big problems and parallel algorithms it can also be valuable to perform several runs of the parallel algorithm, saving the results of each run. You can then look to see if the result and/or error varies with the parallel algorithm run or if it settles over time.

Showing that a parallel algorithm produces error within acceptable thresholds over 1000 runs(just an example, the more data the better for this sort of thing) for various problem sizes is one way to assess the validity of a result.

In the past when I have performed benchmark testing I have noticed wildly varying behavior for the first several runs before the servers have "warmed up". At the time I never bother to check to see if error in the result stabilized over time the same way performance did, but it would be interesting to see.

回复收藏 0 原文

~没有更多了~