如何避免物理磁盘 I/O
我有一个通过网络写入大量数据的进程。假设它在机器 A 上运行,并通过 NFS 将大约 70-80GB 的文件转储到机器 B 上。进程 1 完成并退出后,我的进程 2 在机器 A 上运行,并通过 NFS 从机器 B 获取此文件。整个周期的瓶颈就是这个巨大数据文件的写入和读取。我怎样才能减少这个 输入/输出时间?我能否以某种方式将数据加载到内存中,以便即使进程 1 退出后也可供进程 2 使用?
我很感激这方面的想法。谢谢。
编辑:由于进程2直接从网络“读取”数据,所以最好 先将数据复制到本地,然后再从本地磁盘读取? 我的意思是会 (网络读取时间)> (cp到本地磁盘)+(从本地磁盘读取)
I have a process which writes huge data over the network. Let's say it runs on machine A and dumps around 70-80GB of file on machine B over NFS. After process 1 finishes and exits, my process 2 runs of machine A and fetches this file from machine B over NFS. The bottleneck in the entire cycle is the writing and reading of this huge data file. How can I reduce this
I/O time? Can I somehow keep the data loaded in the memory, ready to use by process 2 even after process 1 has exited?
I'd appreciate ideas on this. Thanks.
Edit: since the process 2 'reads' the data directly from the network, would it be better to
copy the data locally first and then read from the local disk?
I mean would
(read time over network) > (cp to local disk) + (read from local disk)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果您想将加载的数据保留在内存中,则需要 70-80 GB 的 RAM。
最好的办法可能是将本地存储(硬盘驱动器)连接到系统 A 以在本地保存此文件。
If you want to keep the data loaded in memory, then you'll need 70-80 GB of RAM.
The best is maybe to attach a local storage (hard disk drive) to system A to keep this file locally.
显而易见的答案是减少网络写入 - 这似乎可以成倍地节省你的时间并提高可靠性 - 将任何文件复制到另一台机器只是为了将其复制回来似乎没有什么意义,所以在为了更准确地回答您的问题,我们需要更多信息。
The obvious answer is to reduce network writes - which seems could save you time on an exponential scale and improve reliability - there seems very little point in copying any file to another machine only to copy it back, so in order to answer your questions more precisely we will need more information.
这种方法会产生大量网络和 IO 开销。因此您可能无法进一步减少延迟。
There is a lot of network and IO overhead with this approach. So you may not be able to reduce the latency further down.
无论您使用
mmap
还是普通的read
/write
都没有什么区别;无论哪种方式,一切都是通过文件系统缓存/缓冲区发生的。最大的问题是 NFS。提高效率的唯一方法是将中间数据本地存储在机器 A 上,而不是通过网络将其发送到机器 B,然后再将其拉回。Whether you use
mmap
or plainread
/write
should make little difference; either way, everything happens through the filesystem cache/buffers. The big problem is NFS. The only way you can make this efficient is by storing the intermediate data locally on machine A rather than sending it all over the network to machine B only to pull it back again right afterwards.使用 tmpfs 将内存用作(临时)文件。
将 mbuffer 与 netcat 简单地从一个端口中继到另一个端口,而不存储中间流,但仍然允许流以不同的速度发生:
在 machine2 上配置如下作业:
这将允许缓冲最多 2 GB 的数据。如果缓冲区填满 100%,machine2 将开始阻止来自 machine1 的读取,从而延迟输出流而不会失败。
当 machine1 完成传输后,第二个
netcat
将一直保留,直到 mbuffer 耗尽Use tmpfs to leverage memory as (temporary) files.
Use mbuffer with netcat to simply relay from one port to another without storing the intermediate stream, but still allowing streaming to occur at varying speeds:
At machine2 configure a job like:
This will allow at most 2 gigs of data to be buffered. If the buffer is filled 100%, machine2 will just start blocking reads from machine1, delaying the output stream without failing.
When machine1 had completed transmission, the second
netcat
will stay around till the mbuffer is depleted