如何避免物理磁盘 I/O

发布于 2024-11-30 17:15:53 字数 322 浏览 2 评论 0原文

我有一个通过网络写入大量数据的进程。假设它在机器 A 上运行,并通过 NFS 将大约 70-80GB 的文件转储到机器 B 上。进程 1 完成并退出后,我的进程 2 在机器 A 上运行,并通过 NFS 从机器 B 获取此文件。整个周期的瓶颈就是这个巨大数据文件的写入和读取。我怎样才能减少这个 输入/输出时间?我能否以某种方式将数据加载到内存中,以便即使进程 1 退出后也可供进程 2 使用?

我很感激这方面的想法。谢谢。

编辑:由于进程2直接从网络“读取”数据,所以最好 先将数据复制到本地,然后再从本地磁盘读取? 我的意思是会 (网络读取时间)> (cp到本地磁盘)+(从本地磁盘读取)

I have a process which writes huge data over the network. Let's say it runs on machine A and dumps around 70-80GB of file on machine B over NFS. After process 1 finishes and exits, my process 2 runs of machine A and fetches this file from machine B over NFS. The bottleneck in the entire cycle is the writing and reading of this huge data file. How can I reduce this
I/O time? Can I somehow keep the data loaded in the memory, ready to use by process 2 even after process 1 has exited?

I'd appreciate ideas on this. Thanks.

Edit: since the process 2 'reads' the data directly from the network, would it be better to
copy the data locally first and then read from the local disk?
I mean would
(read time over network) > (cp to local disk) + (read from local disk)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

蓝戈者 2024-12-07 17:15:53

如果您想将加载的数据保留在内存中,则需要 70-80 GB 的 RAM。

最好的办法可能是将本地存储(硬盘驱动器)连接到系统 A 以在本地保存此文件。

If you want to keep the data loaded in memory, then you'll need 70-80 GB of RAM.

The best is maybe to attach a local storage (hard disk drive) to system A to keep this file locally.

静待花开 2024-12-07 17:15:53

显而易见的答案是减少网络写入 - 这似乎可以成倍地节省你的时间并提高可靠性 - 将任何文件复制到另一台机器只是为了将其复制回来似乎没有什么意义,所以在为了更准确地回答您的问题,我们需要更多信息。

The obvious answer is to reduce network writes - which seems could save you time on an exponential scale and improve reliability - there seems very little point in copying any file to another machine only to copy it back, so in order to answer your questions more precisely we will need more information.

无戏配角 2024-12-07 17:15:53

这种方法会产生大量网络和 IO 开销。因此您可能无法进一步减少延迟。

  1. 由于文件超过 80 GB,因此创建一个进程 1 将写入的 mmap,稍后进程 2 可以从中读取 - 不涉及网络,仅使用机器 A - 但 IO 开销仍然是不可避免的。
  2. 更快:两个进程可以同时运行,并且您可以使用信号量或其他信号机制,其中进程 1 可以指示进程 2 文件已准备好读取。
  3. 最快的方法:让进程 1 创建共享内存并与进程 2 共享。每当达到限制(可以加载到内存中的最大数据块,基于 RAM 大小)时,让进程 1 向进程 2 发送数据通知可以读取和处理 - 只有当文件/数据实际上可以逐块处理而不是 80GB 的一大块时,此解决方案才是可行的。

There is a lot of network and IO overhead with this approach. So you may not be able to reduce the latency further down.

  1. Since the file is more than 80 GB, create an mmap that process 1 will write into and later process 2 can read from it - no network involved, use only machine A - but still IO overhead is unavoidable.
  2. More faster: both the processes can run simultaneously and you can have a semaphore or other signalling mechanism wherein process 1 can indicate process 2 that the file is ready to be read.
  3. Fastest approach: Let process 1 create a shared memory and share it with process 2. Whenever a limit (maximum data chunk that can be loaded into the memory, based on your RAM size) is reached, let process 1 signal process 2 that the data can be read and processed - this solution will be feasile only if the file/data can actually be processed chunks by chunks instead of one big chunk of your 80GB.
小镇女孩 2024-12-07 17:15:53

无论您使用 mmap 还是普通的 read/write 都没有什么区别;无论哪种方式,一切都是通过文件系统缓存/缓冲区发生的。最大的问题是 NFS。提高效率的唯一方法是将中间数据本地存储在机器 A 上,而不是通过网络将其发送到机器 B,然后再将其拉回。

Whether you use mmap or plain read/write should make little difference; either way, everything happens through the filesystem cache/buffers. The big problem is NFS. The only way you can make this efficient is by storing the intermediate data locally on machine A rather than sending it all over the network to machine B only to pull it back again right afterwards.

人海汹涌 2024-12-07 17:15:53

使用 tmpfs 将内存用作(临时)文件。

mbuffernetcat 简单地从一个端口中继到另一个端口,而不存储中间流,但仍然允许流以不同的速度发生:

机器1:8001 ->机器2:8002 - >机器3:8003

在 machine2 上配置如下作业:

 netcat -l -p 8002 | mbuffer -m 2G | netcat machine3 8003

这将允许缓冲最多 2 GB 的数据。如果缓冲区填满 100%,machine2 将开始阻止来自 machine1 的读取,从而延迟输出流而不会失败。

当 machine1 完成传输后,第二个 netcat 将一直保留,直到 mbuffer 耗尽

Use tmpfs to leverage memory as (temporary) files.

Use mbuffer with netcat to simply relay from one port to another without storing the intermediate stream, but still allowing streaming to occur at varying speeds:

machine1:8001 -> machine2:8002 -> machine3:8003

At machine2 configure a job like:

 netcat -l -p 8002 | mbuffer -m 2G | netcat machine3 8003

This will allow at most 2 gigs of data to be buffered. If the buffer is filled 100%, machine2 will just start blocking reads from machine1, delaying the output stream without failing.

When machine1 had completed transmission, the second netcat will stay around till the mbuffer is depleted

清引 2024-12-07 17:15:53
  1. 您可以使用 RAM 磁盘,因为存储
  2. NFS 速度很慢。尝试使用替代方式将数据传输到另一台电脑。例如 - TCP/IP 流。
  3. 另一个解决方案 - 您可以使用内存数据库(示例为 TimesTen)
  1. You can use RAM disk as storage
  2. NFS is slow. Try use alternative way to transfer data to another PC. For sample - TCP/IP stream.
  3. Another solution - you can use inmemory database (TimesTen for sample)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文