MPI_Finalize 中的总线错误

发布于 2024-12-09 07:38:03 字数 4443 浏览 0 评论 0原文

我正在为并行计算类编写 MPI 程序。我已经让代码正常工作,并且输出了正确的结果,但是当我尝试使用多个进程调用 MPI_Finalize 时,我收到总线错误。我通过 Eclipse 中的 PTP 环境在 OS X 上运行它。错误如下:

[Fruity:49034] *** Process received signal ***
[Fruity:49034] Signal: Bus error (10)
[Fruity:49034] Signal code:  (2)
[Fruity:49034] Failing at address: 0x100336d7e
[Fruity:49034] [ 0] 2   libSystem.B.dylib                   0x00007fff865cc1ba _sigtramp + 26
[Fruity:49034] [ 1] 3   ???                                 0x0000000000000000 0x0 + 0
[Fruity:49034] [ 2] 4   libSystem.B.dylib                   0x00007fff86570c27 tiny_malloc_from_free_list + 1196
[Fruity:49034] [ 3] 5   libSystem.B.dylib                   0x00007fff8656fabd szone_malloc_should_clear + 242
[Fruity:49034] [ 4] 6   libopen-pal.0.dylib                 0x0000000100187b9f opal_memory_base_open + 527
[Fruity:49034] [ 5] 7   libSystem.B.dylib                   0x00007fff8656f98a malloc_zone_malloc + 82
[Fruity:49034] [ 6] 8   libSystem.B.dylib                   0x00007fff8656dc88 malloc + 44
[Fruity:49034] [ 7] 9   libSystem.B.dylib                   0x00007fff8657846d asprintf + 157
[Fruity:49034] [ 8] 10  libopen-rte.0.dylib                 0x000000010013aebc orte_schema_base_get_job_segment_name + 108
[Fruity:49034] [ 9] 11  libopen-rte.0.dylib                 0x000000010013d899 orte_smr_base_set_proc_state + 57
[Fruity:49034] [10] 12  libmpi.0.dylib                      0x0000000100063758 ompi_mpi_finalize + 312
[Fruity:49034] [11] 13  Assignment31                        0x0000000100002642 main + 491
[Fruity:49034] [12] 14  Assignment31                        0x0000000100001688 start + 52
[Fruity:49034] *** End of error message ***
mpirun noticed that job rank 0 with PID 49033 on node Fruity.local exited on signal 15 (Terminated).
1 additional process aborted (not shown)

这是我的代码的主要功能。我确信这里有一些不好的 C++ 实践(我已经很多年没有使用它了,而且是自学的),但它确实输出了正确的值。如果我需要发布文件的其余部分,我可以这样做。如果有明显的错误,我只是不想让这个问题成为一个大问题。

int main(int argc, char* argv[]){
    /* start up MPI */
    MPI_Init(&argc, &argv);

    /* find out process rank */
    MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

    /* find out number of processes */
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);


    /* find which nodes this processor is responsible for */
    findStartAndEndPositions();

    /*Intitialize the array to its starting values. */
    initializeArray();

    /*Find the elements that are dependent on outside processors */
    findDependentElements();

    MPI_Barrier(MPI_COMM_WORLD);
    if(myRank == 0){
        startTime = MPI_Wtime();
        printArray();
    }

    int iter;
    for(iter = 0; iter < NUM_ITERATIONS; iter++){
        doCommunication();
        MPI_Barrier(MPI_COMM_WORLD);
        doIteration();
    }


    double check = computeCheck();
    double receive = 0;

    if(myRank == 0){
        MPI_Reduce(&check, &receive, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
        std::cout << "The total time was: " << MPI_Wtime() - startTime << " \n";
        std::cout << "The checksum was: " << receive << " \n";
        printArray();
    }

    else{
        MPI_Reduce(&check, &receive, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    }

    /* shut down MPI */
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    return 0;
}

编辑:我已将问题范围缩小到 doIteration 函数中的某个位置。只有当调用该函数时,并且只有当我有多个进程正在运行时,我才会收到错误。这是我的 doIteration 函数。它应该用矩阵本身及其四个邻居的最大值来替换不在矩阵边缘的矩阵的每个值。一旦整个更新完成,这些值应该被更新(因此使用数组 temp)。

void doIteration(){
    int pos;
    double* temp = new double[end - start + 1];
    for(pos = start; pos <= end; pos++){
        int i, row, col;
        double max;

        convertToRowCol(pos, &row, &col);

        if(isEdgeNode(row, col))
            continue;

        int dependents[4];
        getDependentsOfPosition(pos, dependents);
        max = a[row][col];

        for(i = 0; i < 4; i++){
            if(isInvalidPos(dependents[i]))
                continue;

            int dRow, dCol;
            convertToRowCol(dependents[i], &dRow, &dCol);
            max = std::max(max, a[dRow][dCol]);
        }

        temp[pos] = max;
    }

    for(pos = start; pos <= end; pos++){
        int row, col;
        convertToRowCol(pos, &row, &col);
        if(! isEdgeNode(row, col))
            a[row][col] = temp[pos];
    }

    delete [] temp;
}

I'm writing an MPI program for a parallel computing class. I've got the code working, and it outputs the correct result, but when I attempt to call MPI_Finalize with more than one process, I get a Buss Error. I'm running this on OS X through the PTP environment in Eclipse. The error is as follows:

[Fruity:49034] *** Process received signal ***
[Fruity:49034] Signal: Bus error (10)
[Fruity:49034] Signal code:  (2)
[Fruity:49034] Failing at address: 0x100336d7e
[Fruity:49034] [ 0] 2   libSystem.B.dylib                   0x00007fff865cc1ba _sigtramp + 26
[Fruity:49034] [ 1] 3   ???                                 0x0000000000000000 0x0 + 0
[Fruity:49034] [ 2] 4   libSystem.B.dylib                   0x00007fff86570c27 tiny_malloc_from_free_list + 1196
[Fruity:49034] [ 3] 5   libSystem.B.dylib                   0x00007fff8656fabd szone_malloc_should_clear + 242
[Fruity:49034] [ 4] 6   libopen-pal.0.dylib                 0x0000000100187b9f opal_memory_base_open + 527
[Fruity:49034] [ 5] 7   libSystem.B.dylib                   0x00007fff8656f98a malloc_zone_malloc + 82
[Fruity:49034] [ 6] 8   libSystem.B.dylib                   0x00007fff8656dc88 malloc + 44
[Fruity:49034] [ 7] 9   libSystem.B.dylib                   0x00007fff8657846d asprintf + 157
[Fruity:49034] [ 8] 10  libopen-rte.0.dylib                 0x000000010013aebc orte_schema_base_get_job_segment_name + 108
[Fruity:49034] [ 9] 11  libopen-rte.0.dylib                 0x000000010013d899 orte_smr_base_set_proc_state + 57
[Fruity:49034] [10] 12  libmpi.0.dylib                      0x0000000100063758 ompi_mpi_finalize + 312
[Fruity:49034] [11] 13  Assignment31                        0x0000000100002642 main + 491
[Fruity:49034] [12] 14  Assignment31                        0x0000000100001688 start + 52
[Fruity:49034] *** End of error message ***
mpirun noticed that job rank 0 with PID 49033 on node Fruity.local exited on signal 15 (Terminated).
1 additional process aborted (not shown)

Here's the main function of my code. I'm sure there's some bad C++ practices in here (I haven't used it in years and its self-taught) but it does output the correct values. If I need to post the rest of the file, I can do that. I just didn't want make this a huge question if there's something obvious wrong.

int main(int argc, char* argv[]){
    /* start up MPI */
    MPI_Init(&argc, &argv);

    /* find out process rank */
    MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

    /* find out number of processes */
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);


    /* find which nodes this processor is responsible for */
    findStartAndEndPositions();

    /*Intitialize the array to its starting values. */
    initializeArray();

    /*Find the elements that are dependent on outside processors */
    findDependentElements();

    MPI_Barrier(MPI_COMM_WORLD);
    if(myRank == 0){
        startTime = MPI_Wtime();
        printArray();
    }

    int iter;
    for(iter = 0; iter < NUM_ITERATIONS; iter++){
        doCommunication();
        MPI_Barrier(MPI_COMM_WORLD);
        doIteration();
    }


    double check = computeCheck();
    double receive = 0;

    if(myRank == 0){
        MPI_Reduce(&check, &receive, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
        std::cout << "The total time was: " << MPI_Wtime() - startTime << " \n";
        std::cout << "The checksum was: " << receive << " \n";
        printArray();
    }

    else{
        MPI_Reduce(&check, &receive, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    }

    /* shut down MPI */
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
    return 0;
}

Edit: I've narrowed down the problem to being somewhere in my doIteration function. I only get the error when that function is called, and only when I have more than one process running. Here's my doIteration function. It is supposed to replace each value of an matrix that isn't on the edge of the matrix with the maximum of itself and its four neighbors. The values are supposed to be updated once the entire update has completed (thus the use of the array temp).

void doIteration(){
    int pos;
    double* temp = new double[end - start + 1];
    for(pos = start; pos <= end; pos++){
        int i, row, col;
        double max;

        convertToRowCol(pos, &row, &col);

        if(isEdgeNode(row, col))
            continue;

        int dependents[4];
        getDependentsOfPosition(pos, dependents);
        max = a[row][col];

        for(i = 0; i < 4; i++){
            if(isInvalidPos(dependents[i]))
                continue;

            int dRow, dCol;
            convertToRowCol(dependents[i], &dRow, &dCol);
            max = std::max(max, a[dRow][dCol]);
        }

        temp[pos] = max;
    }

    for(pos = start; pos <= end; pos++){
        int row, col;
        convertToRowCol(pos, &row, &col);
        if(! isEdgeNode(row, col))
            a[row][col] = temp[pos];
    }

    delete [] temp;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

邮友 2024-12-16 07:38:03

我不确定是否是这个原因,但是MPI_Reduce通常是一行,没有必要写两行。试试这个看看是否有帮助。

MPI_Reduce(&check, &receive, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if(myRank == 0){
     std::cout << "The total time was: " << MPI_Wtime() - startTime << " \n";
     std::cout << "The checksum was: " << receive << " \n";
     printArray();
}

I am not sure whether this is the reason, but MPI_Reduce is usually one line, there is no need to write two lines. Try this to see if it helps.

MPI_Reduce(&check, &receive, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if(myRank == 0){
     std::cout << "The total time was: " << MPI_Wtime() - startTime << " \n";
     std::cout << "The checksum was: " << receive << " \n";
     printArray();
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文