写入二进制性能 numpy.ndarray.tofile vs numpy.ndarray.tobytes vs C++文件写入

发布于 2025-01-12 01:30:45 字数 1998 浏览 1 评论 0原文

我正在尝试将一些大型数组写入磁盘。我测试了3个选项； Python 中的 2：

    import timeit
    import numpy as np

    # N=800 generates files about 4GB
    N=800
    compute_start=timeit.default_timer()

    vals = np.sqrt((np.arange(N)**2)[:,None,None]+(np.arange(N)**2)[None,:,None]+(np.arange(N)**2)[None,None,:])

    compute_end=timeit.default_timer()
    print("Compute time: ",compute_end-compute_start)

    tofile_start=timeit.default_timer()
    for i in range(2):
       f = open("out.bin", "wb")
       vals.tofile(f)
       f.close()                                                                                                                                                                                                       
    tofile_end=timeit.default_timer()
    print("tofile time: ",tofile_end-tofile_start)

    tobytes_start=timeit.default_timer()
    for i in range(2):
       f = open("out.bin", "wb")
       f.write(vals.tobytes())
       f.close()
    tobytes_end=timeit.default_timer()
    print("tobytes time: ",tobytes_end-tobytes_start)

对于 C++（使用 g++ -O3 编译），

#include<chrono>
#include<fstream>
#include<vector>
int main(){
   std::vector<double> q(800*800*800, 3.14);

   auto dump_start = std::chrono::steady_clock::now();
   
   for (int i=0; i<2; i++) {
      std::ofstream outfile("out.bin",std::ios::out | std::ios::binary);
      outfile.write(reinterpret_cast<const char*>(&q[0]), q.size()*sizeof(double));
      outfile.close();
   }   

   auto dump_end = std::chrono::steady_clock::now();

   std::printf("Dump time: %12.3f\n",(std::chrono::duration_cast<std::chrono::microseconds>(dump_end - dump_start).count())/1000000.0);

   return 0;
}

tofile 的报告时间为 16 秒，tobyte 为 39 秒，tobyte 为 34 秒。关于为什么它们应该如此不同的任何想法吗？特别是文档说 numpy.ndarray.tofile() 相当于file.write(numpy.ndarray.tobytes()).

谢谢~

原文

I'm trying to write to disk some large arrays. I've tested 3 options;
2 in Python:

    import timeit
    import numpy as np

    # N=800 generates files about 4GB
    N=800
    compute_start=timeit.default_timer()

    vals = np.sqrt((np.arange(N)**2)[:,None,None]+(np.arange(N)**2)[None,:,None]+(np.arange(N)**2)[None,None,:])

    compute_end=timeit.default_timer()
    print("Compute time: ",compute_end-compute_start)

    tofile_start=timeit.default_timer()
    for i in range(2):
       f = open("out.bin", "wb")
       vals.tofile(f)
       f.close()                                                                                                                                                                                                       
    tofile_end=timeit.default_timer()
    print("tofile time: ",tofile_end-tofile_start)

    tobytes_start=timeit.default_timer()
    for i in range(2):
       f = open("out.bin", "wb")
       f.write(vals.tobytes())
       f.close()
    tobytes_end=timeit.default_timer()
    print("tobytes time: ",tobytes_end-tobytes_start)

And for C++ (compiled with g++ -O3

#include<chrono>
#include<fstream>
#include<vector>
int main(){
   std::vector<double> q(800*800*800, 3.14);

   auto dump_start = std::chrono::steady_clock::now();
   
   for (int i=0; i<2; i++) {
      std::ofstream outfile("out.bin",std::ios::out | std::ios::binary);
      outfile.write(reinterpret_cast<const char*>(&q[0]), q.size()*sizeof(double));
      outfile.close();
   }   

   auto dump_end = std::chrono::steady_clock::now();

   std::printf("Dump time: %12.3f\n",(std::chrono::duration_cast<std::chrono::microseconds>(dump_end - dump_start).count())/1000000.0);

   return 0;
}

Times reported are 16 seconds for tofile, 39 seconds for tobyte and 34 for write. Any ideas on why they should be so different? Especially the two Numpy cases; the docs say that numpy.ndarray.tofile() is equivalent to file.write(numpy.ndarray.tobytes()).

Thank you~

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北斗星光 2025-01-19 01:30:45

最近，在将大型数据集（16 GB）写入原始二进制文件时，我一直被 numpy.ndarray.tofile() 的速度所困扰，这对我的情况有帮助（运行在Windows 10），尽管我不明白为什么：

使用 numpy.ndarray.flatten().tofile()，假设您在写入二进制文件时不被结构所困扰，
而不是覆盖现有的文件，先删除它，然后写入文件

因此，使用您的变量名称，在我的例子中使用此代码：

import numpy
import os

if os.path.exists("out.bin"):
   os.remove("out.bin")
vals.flatten().tofile("out.bin", sep='', format='<H')

写入速度从大约 60 MB/s 增加到几乎 200 MB/s（仍然低于固态硬盘）。然而，写入速度并不是恒定的，有时会下降。我希望它可能仍然有帮助。

I've been bothered by the speed of numpy.ndarray.tofile() when writing large data sets (16 GB) to raw binary files as well lately, and here's what helped in my case (running on Windows 10), though I don't understand why:

use numpy.ndarray.flatten().tofile(), assuming you're not bothered by the structure when writing binary files
instead of overwriting an existing file, delete it first and then write the file

So with your variable names, by using this this code in my case:

import numpy
import os

if os.path.exists("out.bin"):
   os.remove("out.bin")
vals.flatten().tofile("out.bin", sep='', format='<H')

the writing speed increased from around 60 MB/s to almost 200 MB/s (which is still below the limit of the SSD). However, the writing speed isn't constant and will sometimes drop. It might still be helpful I hope.

回复收藏 0 原文

~没有更多了~