同时处理多个文件 —复制文件还是通过 NFS 读取文件?
我需要同时处理大量文件(数千个不同的文件,每个文件的平均大小为 2MB)。
所有信息都存储在一个(1.5TB)网络硬盘上,并将由大约 30 台不同的机器进行处理。为了提高效率,每台机器将读取(并处理)不同的文件(有数千个文件需要处理)。
每台机器在从 1.5TB 硬盘上的“传入”文件夹中读取文件后,都会处理该信息,并准备将处理后的信息输出回 1.5TB 硬盘上的“已处理”文件夹。每个文件的处理信息的平均大小与输入文件大致相同(每个文件约 2MB)。
更好的做法是:
(1)对于每台处理机M,将M将要处理的所有文件复制到其本地硬盘中,然后读取& ;在机器 M 上本地处理文件。
(2) 每台机器不会将文件复制到每台机器,而是直接访问“传入”文件夹(使用 NFS),并从那里读取文件,然后在本地处理它们。
哪个想法更好?当一个人做这样的事情时,有什么“该做”和“不该做”的事情吗?
我最好奇的是,让 30 台左右的机器同时读取(或写入)信息到同一个网络驱动器是否会出现问题?
(注意:现有文件只能被读取,不能附加/写入;新文件将从头开始创建,因此不存在对同一文件进行多次访问的问题...)。是否存在我应该预料到的瓶颈?
(如果重要的话,我在所有机器上使用 Linux、Ubuntu 10.04 LTS)
I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be processed by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
What is the better thing to do:
(1) For every processing machine M, Copy all files that will be processed by M into its local hard drive, and then read & process the files locally on machine M.
(2) Instead of copying the files to every machine, every machine will access the 'incoming' folder directly (using NFS), and will read the files from there, and then process them locally.
Which idea is better? Are there any 'do' and 'donts' when one is doing such a thing?
I am mostly curious if it is a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...). Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我肯定会做#2 - 我会这样做:
在主服务器上运行 Apache 以及所有文件。 (或者其他一些 HTTP 服务器,如果你真的想要的话)。我这样做有几个原因:
HTTP 基本上是纯 TCP(上面有一些标头)。一旦请求被发送——这是一个非常“单向”的协议。开销低,不闲聊。高性能和高效率 - 低开销。
如果您(无论出于何种原因)决定需要移动或扩展它(例如,使用云服务),HTTP 将是比 NFS 更好的通过开放 Internet 移动数据的方式。您可以使用 SSL(如果需要)。您可以通过防火墙(如果需要)。等等..等等..等等...
根据文件的访问模式,并假设需要读取整个文件 - 仅执行一个网络操作会更容易/更快- 一次将整个文件拉入 - 而不是每次读取文件的一小部分时不断地通过网络请求 I/O。
分发和运行一个执行所有这些操作的应用程序可能很容易 - 并且不依赖于网络安装的存在 - 特定文件路径等。如果您有文件的 URL - 客户端可以做到这一点工作。它不需要建立挂载、硬目录 - 或成为 root 来设置此类挂载。
如果您遇到 NFS 连接问题 - 当您尝试访问安装并且它们挂起时,整个系统可能会变得异常。当 HTTP 在用户空间上下文中运行时 - 您只会收到超时错误 - 并且您的应用程序可以采取它选择的任何操作(例如页面您 - 记录错误等)。
I would definitely do #2 - and I would do it as follows:
Run Apache on your main server with all the files. (Or some other HTTP server, if you really want). There are several reason's I'd do it this way:
HTTP is basically pure TCP (with some headers on it). Once the request is sent - it's a very "one-way" protocol. Low overhead, not chatty. High performance and efficiency - low overhead.
If you (for whatever reason) decided you needed to move or scale it out (using a could service, for example) HTTP would be a much better way to move the data around over the open Internet, than NFS. You could use SSL (if needed). You could get through firewalls (if needed). etc..etc..etc...
Depending on the access pattern of your file, and assuming the whole file is required to be read - it's easier/faster just to do one network operation - and pull the whole file in in one whack - rather than to constantly request I/Os over the network every time you're reading a smaller piece of the file.
It could be easy to distribute and run an application that does all this - and doesn't rely on the existance of network mounts - specific file paths, etc. If you have the URL to the files - the client can do it's job. It doesn't need to have established mounts, hard directory - or to become root to set-up such mounts.
If you have NFS connectivity problems - the whole system can get whacky when you try to access the mounts and they hang. With HTTP running in a user-space context - you just get a timeout error - and your application can take whatever action it chooses (like page you - log errors, etc).