Mac OS X 上 fork 后的内存访问速度极慢

发布于 2024-10-07 11:57:11 字数 2410 浏览 1 评论 0原文

以下代码在 Mac OS X 上的执行速度比在 Linux 上慢约 200 倍。我不知道为什么,而且这个问题似乎似乎微不足道。我怀疑 Mac 上的 gcc 或 Mac OS X 本身或我的硬件中存在错误。

该代码会分叉进程,该进程将复制整个页表,但复制 Mac OS X 上的内存。写入时会复制内存,这发生在 run 方法末尾的 for 循环中。在那里,对于 run 的前 4 次调用,必须复制所有页面,因为每个页面都被触及。对于要运行的后 4 个调用(其中skip 为 512),需要复制每隔一个页面,因为每隔一个页面都会被触摸。直观上,前 4 个调用的时间大约是后 4 个调用的两倍,但事实并非如此。对我来说,该程序的输出如下:

169.655ms
670.559ms
2784.18ms
16007.1ms
16.207ms
25.018ms
42.712ms
79.676ms

在 Linux 上,

5.306ms
10.69ms
20.91ms
41.042ms
6.115ms
12.203ms
23.939ms
40.663ms

对于两次使用 gcc 编译的完全相同的程序,Mac OS X 上的总运行时间大约为 20 秒,在 Linux 上大约为 0.5 秒。我尝试使用 gcc4、4.2 和 4.4 编译 mac os 版本 - 没有变化。

有什么想法吗?

代码:

#include <stdint.h>
#include <iostream>
#include <sys/types.h>
#include <unistd.h>
#include <signal.h>
#include <cstring>
#include <cstdlib>
#include <sys/time.h>

using namespace std;

class Timestamp
{
   private:
   timeval time;

   public:
   Timestamp() { gettimeofday(&time,0); }

   double operator-(const Timestamp& other) const { return static_cast<double>((static_cast<long long>(time.tv_sec)*1000000+(time.tv_usec))-(static_cast<long long>(other.time.tv_sec)*1000000+(other.time.tv_usec)))/1000.0; }
};

class ForkCoW
{
public:
   void run(uint64_t size, uint64_t skip) {
      // allocate and initialize array
      void* arrayVoid;
      posix_memalign(&arrayVoid, 4096, sizeof(uint64_t)*size);
      uint64_t* array = static_cast<uint64_t*>(arrayVoid);
      for (uint64_t i = 0; i < size; ++i)
         array[i] = 0;

      pid_t p = fork();
      if (p == 0)
         sleep(99999999);

      if (p < 0) {
         cerr << "ERRROR: Fork failed." << endl;
         exit(-1);
      }

      {
         Timestamp start;
         for (uint64_t i = 0; i < size; i += skip) {
            array[i] = 1;
         }
         Timestamp stop;
         cout << (stop-start) << "ms" << endl;
      }
      kill(p,SIGTERM);
   }
};

int main(int argc, char* argv[]) {
   ForkCoW f;
   f.run(1ull*1000*1000, 512);
   f.run(2ull*1000*1000, 512);
   f.run(4ull*1000*1000, 512);
   f.run(8ull*1000*1000, 512);

   f.run(1ull*1000*1000, 513);
   f.run(2ull*1000*1000, 513);
   f.run(4ull*1000*1000, 513);
   f.run(8ull*1000*1000, 513);
}

The following code executes about 200 times slower on Mac OS X than on Linux. I don't know why and the problem does not seem to be trivial. I suspect a bug in gcc on the Mac or in Mac OS X itself or in my hardware.

The code forks the process which will copy the page table entires but not the memory on Mac OS X. The memory is copied when written to which happens in the for loop at the end of the run method. There, for the first 4 calls of run, all pages have to be copied because every page is touched. For the second 4 calls to run where skip is 512, every second page needs to be copied since every second page is touched. Intuitively, the first 4 calls should take about twice as long as the second 4 calls which is absolutely not the case. For me, the output of the program is as follows:

169.655ms
670.559ms
2784.18ms
16007.1ms
16.207ms
25.018ms
42.712ms
79.676ms

On Linux it is

5.306ms
10.69ms
20.91ms
41.042ms
6.115ms
12.203ms
23.939ms
40.663ms

Total runtime on Mac OS X is rougly 20 seconds, about 0.5 seconds on Linux for the exact same program both times compiled with gcc. I've tried compiling the mac os version wiht gcc4, 4.2 and 4.4 - no change.

Any ideas?

Code:

#include <stdint.h>
#include <iostream>
#include <sys/types.h>
#include <unistd.h>
#include <signal.h>
#include <cstring>
#include <cstdlib>
#include <sys/time.h>

using namespace std;

class Timestamp
{
   private:
   timeval time;

   public:
   Timestamp() { gettimeofday(&time,0); }

   double operator-(const Timestamp& other) const { return static_cast<double>((static_cast<long long>(time.tv_sec)*1000000+(time.tv_usec))-(static_cast<long long>(other.time.tv_sec)*1000000+(other.time.tv_usec)))/1000.0; }
};

class ForkCoW
{
public:
   void run(uint64_t size, uint64_t skip) {
      // allocate and initialize array
      void* arrayVoid;
      posix_memalign(&arrayVoid, 4096, sizeof(uint64_t)*size);
      uint64_t* array = static_cast<uint64_t*>(arrayVoid);
      for (uint64_t i = 0; i < size; ++i)
         array[i] = 0;

      pid_t p = fork();
      if (p == 0)
         sleep(99999999);

      if (p < 0) {
         cerr << "ERRROR: Fork failed." << endl;
         exit(-1);
      }

      {
         Timestamp start;
         for (uint64_t i = 0; i < size; i += skip) {
            array[i] = 1;
         }
         Timestamp stop;
         cout << (stop-start) << "ms" << endl;
      }
      kill(p,SIGTERM);
   }
};

int main(int argc, char* argv[]) {
   ForkCoW f;
   f.run(1ull*1000*1000, 512);
   f.run(2ull*1000*1000, 512);
   f.run(4ull*1000*1000, 512);
   f.run(8ull*1000*1000, 512);

   f.run(1ull*1000*1000, 513);
   f.run(2ull*1000*1000, 513);
   f.run(4ull*1000*1000, 513);
   f.run(8ull*1000*1000, 513);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

还在原地等你 2024-10-14 11:57:11

如此长时间睡眠的唯一原因是这一行:

sleep(300000);

这导致 300 秒的睡眠 (300*1000)。也许 mac os x 上 fork() 的实现与您期望的不同(并且它总是返回 0)。

Only reason for such a long sleep would be this line:

sleep(300000);

which results in 300 seconds of sleep (300*1000). Maybe the implementation of fork() is different on mac os x than you expect (and it always returns 0).

梦里人 2024-10-14 11:57:11

这与C++无关。我用 C 重写了您的示例,并使用 waitpid(2) 而不是 sleep/SIGCHLD 并且无法重现问题:

#include <errno.h>
#include <inttypes.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>

void ForkCoWRun(uint64_t size, uint64_t skip) {
      // allocate and initialize array
      uint64_t* array;
      posix_memalign((void **)&array, 4096, sizeof(uint64_t)*size);
      for (uint64_t i = 0; i < size; ++i)
             array[i] = 0;

      pid_t p = fork();
      switch(p) {
          case -1:
              fprintf(stderr, "ERRROR: Fork failed: %s\n", strerror(errno));
              exit(EXIT_FAILURE);
          case 0:
          {
              struct timeval start, stop;
              gettimeofday(&start, 0);
              for (uint64_t i = 0; i < size; i += skip) {
                 array[i] = 1;
              }
              gettimeofday(&stop, 0);

              long microsecs = (long)(stop.tv_sec - start.tv_sec) *1000000 + (long)(stop.tv_usec - start.tv_usec);
              printf("%ld.%03ld ms\n", microsecs / 1000, microsecs % 1000);
              exit(EXIT_SUCCESS);
          }
          default:
          {
              int exit_status;
              waitpid(p, &exit_status, 0);
              break;
          }
    }
}

int main(int argc, char* argv[]) {
    ForkCoWRun(1ull*1000*1000, 512);
    ForkCoWRun(2ull*1000*1000, 512);
    ForkCoWRun(4ull*1000*1000, 512);
    ForkCoWRun(8ull*1000*1000, 512);

    ForkCoWRun(1ull*1000*1000, 513);
    ForkCoWRun(2ull*1000*1000, 513);
    ForkCoWRun(4ull*1000*1000, 513);
    ForkCoWRun(8ull*1000*1000, 513);
}

在 OS X 10.8、10.9 和 10.10 上,我得到如下结果:

6.163 ms
12.239 ms
24.529 ms
49.223 ms
6.027 ms
12.081 ms
24.270 ms
49.498 ms

This has nothing to do with C++. I rewrote your example in C and using waitpid(2) instead of sleep/SIGCHLD and cannot reproduce a problem:

#include <errno.h>
#include <inttypes.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>

void ForkCoWRun(uint64_t size, uint64_t skip) {
      // allocate and initialize array
      uint64_t* array;
      posix_memalign((void **)&array, 4096, sizeof(uint64_t)*size);
      for (uint64_t i = 0; i < size; ++i)
             array[i] = 0;

      pid_t p = fork();
      switch(p) {
          case -1:
              fprintf(stderr, "ERRROR: Fork failed: %s\n", strerror(errno));
              exit(EXIT_FAILURE);
          case 0:
          {
              struct timeval start, stop;
              gettimeofday(&start, 0);
              for (uint64_t i = 0; i < size; i += skip) {
                 array[i] = 1;
              }
              gettimeofday(&stop, 0);

              long microsecs = (long)(stop.tv_sec - start.tv_sec) *1000000 + (long)(stop.tv_usec - start.tv_usec);
              printf("%ld.%03ld ms\n", microsecs / 1000, microsecs % 1000);
              exit(EXIT_SUCCESS);
          }
          default:
          {
              int exit_status;
              waitpid(p, &exit_status, 0);
              break;
          }
    }
}

int main(int argc, char* argv[]) {
    ForkCoWRun(1ull*1000*1000, 512);
    ForkCoWRun(2ull*1000*1000, 512);
    ForkCoWRun(4ull*1000*1000, 512);
    ForkCoWRun(8ull*1000*1000, 512);

    ForkCoWRun(1ull*1000*1000, 513);
    ForkCoWRun(2ull*1000*1000, 513);
    ForkCoWRun(4ull*1000*1000, 513);
    ForkCoWRun(8ull*1000*1000, 513);
}

and on OS X 10.8, 10.9, and 10.10, I get results like:

6.163 ms
12.239 ms
24.529 ms
49.223 ms
6.027 ms
12.081 ms
24.270 ms
49.498 ms
╄→承喏 2024-10-14 11:57:11

您一次分配 400 MB,然后再次从 fork() 分配(因为该过程是重复的,包括内存分配)。

缓慢的原因可能很简单,因为在具有两个进程的 fork() 中,您耗尽了可用的物理内存,并且正在使用磁盘中的 swap 内存。

这通常比使用物理内存慢得多。

编辑以下注释

我建议您更改代码以在写入数组的第一个元素后开始计时测量。

  array[0] = 1;
  Timestamp start;
    for (int64_t i = 1; i < size; i++) {
       array[i] = 1;

这样,第一次写入后的内存分配所使用的时间将不会被考虑到时间戳中。

You are allocating 400 megabytes once, and once again from the fork() (Since the process is duplicated including the memory allocation).

The reason of the slowness could be simply that from the fork() with two processes, you run out of available physical memory, and are using the swap memory from the disk.

This is usually much slower than using the physical memory.

Edit following comments

I suggest you change the code to start the timing measurement after writing to the first element of the array.

  array[0] = 1;
  Timestamp start;
    for (int64_t i = 1; i < size; i++) {
       array[i] = 1;

This way, the time used by the memory allocation following the first write will not be taken into account in the timestamp.

挽梦忆笙歌 2024-10-14 11:57:11

我怀疑你的问题是Linux上的执行顺序是它首先运行父进程,然后父进程执行,子进程终止,因为它的父进程消失了,但在Mac OS上它首先运行子进程,这涉及300秒的睡眠。

在任何 Unix 标准中绝对都不能保证分叉后的两个进程将并行运行。尽管您声称操作系统有能力这样做。

只是为了证明这是睡眠时间,我将代码中的“30000”替换为“SLEEPTIME”,并使用 g++ -DSLEEPTIME=?? 编译并运行它。 foo.c && ./a.out

SLEEPTIME   output
20          20442.1
30          30468.5
40          40431.4
10          10449  <just to prove it wasn't getting longer each run>

I suspect your problem is the order of execution on Linux is that it runs the parent first, and then the parent executes and the child terminates because its parent is gone, but on Mac OS it runs the child first, which involves a 300 second sleep.

There is absolutely no guarantee in any Unix standard that the two processes after a fork will run in parallel. Your assertions about the capability of the OS to do so notwithstanding.

Just to prove it's the sleep time, I replaced the "30000" your code with "SLEEPTIME" and compiled and ran it with g++ -DSLEEPTIME=?? foo.c && ./a.out:

SLEEPTIME   output
20          20442.1
30          30468.5
40          40431.4
10          10449  <just to prove it wasn't getting longer each run>
那请放手 2024-10-14 11:57:11

当您将父级 waitpid() 放在子级上并确保其退出时会发生什么(为了安全起见,处理 SIGCHLD 以确保进程被收获。)似乎在 Linux 上,子进程可能会更快退出,现在页面错误处理程序必须做更少的写时复制工作,因为页面仅由单个进程引用。

其次...你知道 fork() 必须做什么工作吗?特别是不应该假设它是“快”的。从语义上讲,它必须复制进程地址空间中的每个页面。他们说,从历史上看,旧的 Unix 就是这么做的。通过最初将这些页面标记为“写入时复制”(即,这些页面被标记为只读,并且内核的页面错误处理程序将在第一次写入时复制它们)可以改进这一点,但这仍然是< /i> 大量工作,这意味着您在每个页面上的第一次写入访问将会很慢。

我祝贺 Linux 开发人员非常快速地为您的访问模式提供了 fork() 和写时复制实现...但是,如果声称这是一个巨大的问题,这似乎是一件非常奇怪的事情Mac OS 的内核不太好,或者系统的其他部分碰巧生成不同的访问模式,或者其他什么。分叉,以及在分叉后写入页面,应该不会很快。

我想我想说的是,如果你将代码移动到具有不同设计选择集的内核,突然间你的 fork() 变得更慢、更困难,这是将您的代码移动到不同的操作系统。

What happens when you have the parent waitpid() on the child and ensure that it is exited (and to be safe handle SIGCHLD to ensure that the process is reaped.) It seems possible that on Linux the child could have exited sooner and now the page fault handler has to do less work to copy-on-write since the pages are only referenced by a single process.

Second... Do you have any idea the kind of work fork() has to do? In particular it should not be assumed to be "fast". Semantically speaking, it has to duplicate every page in the process's address space. Historically this is what old Unix did, so they say. This is improved by initially marking these pages as "copy-on-write" (that is, the pages are marked read-only and the kernel's page fault handler will duplicate them at the first write), but this is still a lot of work, and it means that your first write access on every page will be slow.

I congratulate the Linux developers for getting their fork() and their copy-on-write implementation very fast for your access pattern... But it seems a very strange thing to claim that it's a huge problem if Mac OS's kernel is not as good, or if other parts of the system happen to generate different access patterns, or whatever. Fork, and writing pages after a fork, is not supposed to be fast.

I suppose what I am trying to say is if you move your code to a kernel that has a different set of design choices and all of a sudden your fork()s are slower, tough, that's part of moving your code to a different OS.

素罗衫 2024-10-14 11:57:11

您是否已验证 fork() 正在工作:

int main() 
{
    pid_t pid = fork();

    if( pid > 0 ) {
        std::cout << "Parent\n";
    } else if( pid == 0 ) {
        std::cout << "Child\n";
    } else {
        std::cout << "Failed to fork!\n";
    }
}

也许 MAC OS-X 关于分叉子进程有一些限制。

Have you verified that fork() is working:

int main() 
{
    pid_t pid = fork();

    if( pid > 0 ) {
        std::cout << "Parent\n";
    } else if( pid == 0 ) {
        std::cout << "Child\n";
    } else {
        std::cout << "Failed to fork!\n";
    }
}

Maybe there is some restriction on MAC OS-X about forking child processes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文