I/O Bound problems

发布于 2025-02-25 23:43:59 字数 3576 浏览 0 评论 0 收藏 0

Sometimes the issue is that you need to load or save massive amounts of data, and the transfer to and from the hard disk is the bootleneck. Possible solutions include 1) use of binary rather than text data, 2) use of data compression, 3) use of specialized data structures such as HDF5.

If you are working wiht huge amounts of data, conisder the use of 1) relational databases if there are many rleations to manage, 2) HDF5 if a hiearchical structure is natural, and 3) NoSQL databases such as Redis if the data relatons are simple and you need to transfer over the network.

Pandas also offers convenient access to multiple storage and retrieval options via its DataFramee object.

def io1(xs):
    """Using loops to write."""
    with open('foo1.txt', 'w') as f:
        for x in xs:
            f.write('%d\t' % x)

def io2(xs):
    """Join before writing."""
    with open('foo2.txt', 'w') as f:
        f.write('\t'.join(map(str, xs)))

def io3(xs):
    """Numpy savetxt is surprisingly slow."""
    np.savetxt('foo3.txt', xs, delimiter='\t')

def io4(xs):
    """NUmpy save is better if binary format is OK."""
    np.save('foo4.npy', xs)

def io5(xs):
    """Using HDF5."""
    import h5py
    with h5py.File("mytestfile1.h5", "w") as f:
        ds = f.create_dataset("xs", (len(xs),), dtype='i')
        ds[:] = xs

def io6(xs):
    """Using HDF5 with compression."""
    import h5py
    with h5py.File("mytestfile2.h5", "w") as f:
        ds = f.create_dataset("xs", (len(xs),), dtype='i', compression="lzf")
        ds[:] = xs

n = 1000*1000
xs = range(n)
%timeit -r1 -n1 io1(xs)
%timeit -r1 -n1 io2(xs)
%timeit -r1 -n1 io3(xs)
%timeit -r1 -n1 io4(xs)
%timeit -r1 -n1 io5(xs)
%timeit -r1 -n1 io6(xs)

1 loops, best of 1: 1.64 s per loop
1 loops, best of 1: 320 ms per loop
1 loops, best of 1: 6.7 s per loop
1 loops, best of 1: 108 ms per loop
1 loops, best of 1: 154 ms per loop
1 loops, best of 1: 122 ms per loop

def io11(xs):
    """Using basic python."""
    with open('foo1.txt', 'r') as f:
        xs = map(int, f.read().strip().split('\t'))
    return xs

def io12(xs):
    """Using pandsa."""
    xs = pd.read_table('foo2.txt').values.tolist()
    return xs

def io13(xs):
    """Numpy loadtxt."""
    xs = np.loadtxt('foo3.txt',delimiter='\t')
    return xs

def io14(xs):
    """Numpy load."""
    xs = np.load('foo4.npy')
    return xs

def io15(xs):
    """Using HDF5."""
    import h5py
    with h5py.File("mytestfile1.h5", 'r') as f:
        xs = f['xs'][:]
    return xs

def io16(xs):
    """Using HDF5 with compression."""
    import h5py
    with h5py.File("mytestfile2.h5", 'r') as f:
        xs = f['xs'][:]
    return xs

n = 1000*1000
xs = range(n)
%timeit -r1 -n1 io11(xs)
%timeit -r1 -n1 io12(xs)
%timeit -r1 -n1 io13(xs)
%timeit -r1 -n1 io14(xs)
%timeit -r1 -n1 io15(xs)
%timeit -r1 -n1 io16(xs)

1 loops, best of 1: 805 ms per loop
1 loops, best of 1: 51.3 s per loop
1 loops, best of 1: 5.56 s per loop
1 loops, best of 1: 15.2 ms per loop
1 loops, best of 1: 9.69 ms per loop
1 loops, best of 1: 16 ms per loop

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

I/O Bound problems

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。