并行操作能否加快 R 中硬盘文件的可用性?
我有一个巨大的数据文件(~4GB),我正在通过 R (进行一些字符串清理)将其传递到 MySQL 数据库。每行/线彼此独立。使用并行操作来完成这个过程是否有任何速度优势?也就是说,一个线程是否可以从不跳过任何行开始并每隔两行扫描一次,而另一个线程是否可以从跳过 1 行开始并每隔两行读取一次?如果是这样,它实际上会加速进程吗?或者争夺 10K Western Digital 硬盘(而不是 SSD)的两个线程会抵消任何可能的优势吗?
I have a huge datafile (~4GB) that I am passing through R (to do some string clean up) on its way into an MySQL database. Each row/line is independent from the other. Is there any speed advantage to be had by using parallel operations to finish this process? That is, could one thread start with by skipping no lines and scan every second line and another start with a skip of 1 line and read every second line? If so, would it actually speed up the process or would the two threads fighting for the 10K Western Digital hard drive (not SSD) negate any possible advantages?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
答案是也许。在某些时候,磁盘访问将受到限制。无论是在 2 个核心运行还是 8 个核心运行时发生这种情况,取决于您的硬件设置的特性。在用 top 观察你的系统的同时尝试一下是很容易的。如果您的 %wa 始终高于零,则意味着 CPU 正在等待磁盘赶上,并且您可能会减慢整个过程。
The answer is maybe. At some point, disk access will become limiting. Whether this happens with 2 cores running or 8 depends on the characteristics of your hardware setup. It'd be pretty easy to just try it out, while watching your system with top. If your %wa is consitently above zero, it means that the CPUs are waiting for the disk to catch up and you're likely slowing the whole process down.
为什么不直接使用一些标准 Unix 工具将文件分割成块并并行调用多个 R 命令行表达式来处理每个块?不需要花哨,只要简单就能做到。
Why not just use some of the standard Unix tools to split the file into chunks and call several R command-line expressions in parallel working on a chunk each? No need to be fancy if simple can do.
瓶颈可能是硬盘。无论有多少进程尝试访问它,它一次只能读/写一件事情。
这假设“字符串清理”使用最少的 CPU。 awk 或 sed 通常比 R 更适合此目的。
The bottleneck will likely be the HDD. It doesn't matter how many processes are trying to access it, it can only read/write one thing at a time.
This assumes the "string clean up" uses minimal CPU. awk or sed are generally better for this than R.
您可能希望通过一次线性正向传递从磁盘读取数据,因为操作系统和磁盘针对这种情况进行了大量优化。但是您可以从读取磁盘的位置将行块分配给工作线程/进程。 (如果您可以进行进程并行而不是线程并行,那么您可能应该——全面减少麻烦。)
您能描述一下所需的字符串清理吗? R 并不是我想要进行字符串攻击的第一个东西。
You probably want to read from the disk in one linear forward pass, as the OS and the disk optimize heavily for that case. But you could parcel out blocks of lines to worker threads/processes from where you're reading the disk. (If you can do process parallelism rather than thread parallelism, you probably should - way less hassle all 'round.)
Can you describe the string cleanup that's required? R is not the first thing I would reach for for string bashing.
Ruby 是另一种用于文件操作和清理的简单脚本语言。但这仍然是处理时间与读取时间之比的问题。如果重点是做诸如选择列或重新排列之类的事情,那么最好使用 ruby、awk 或 sed,即使对于简单的计算,这些也会更好。但是,如果您要对每一行拟合回归模型或执行模拟,那么您最好并行执行这些任务。这个问题无法有明确的答案,因为我们不知道确切的参数。但听起来对于大多数简单的清理工作来说,最好使用像 ruby 这样适合它的语言并在单线程中运行它。
Ruby is another easy scripting language for file manipulations and clean up. But still it is an issue of the ratio of processing time vs reading time. If the point is to do things like select out columns or rearrange things you are far better off going with ruby, awk or sed, even for simple computations those would be better. but if for each line you are say, fitting a regression model or performing a simulation, you would be better doing the tasks in parallel. The question cannot have a definite answer because we don't know the exact parameters. But it sound like for most simple cleanup jobs it would be better to use a language well suited for it like ruby and run it in a single thread.