使用 R/python 和 SSD 进行数据分析
有谁有使用 r/python 处理存储在固态硬盘中的数据的经验吗?如果您主要进行读取操作,理论上这应该会显着缩短大型数据集的加载时间。我想知道这是否属实,以及是否值得投资 SSD 以提高数据密集型应用程序的 IO 速率。
Does anyone have any experience using r/python with data stored in Solid State Drives. If you are doing mostly reads, in theory this should significantly improve the load times of large datasets. I want to find out if this is true and if it is worth investing in SSDs for improving the IO rates in data intensive applications.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我的 2 美分:只有当您的应用程序存储在 SSD 上而不是您的数据时,SSD 才有价值。即使如此,只有在需要大量访问磁盘的情况下,例如操作系统。人们向您指出分析是正确的。我可以不做就告诉你,几乎所有的读取时间都花在了处理上,而不是在磁盘上读取。
考虑数据的格式而不是数据的存储位置会带来更大的回报。通过使用正确的应用程序和正确的格式,可以加快读取数据的速度。就像使用 R 的内部格式而不是摸索文本文件一样。让它成为一个感叹号:永远不要继续摸索文本文件。如果您需要速度,请选择二进制。
由于开销,如果您有 SSD 或普通磁盘来读取数据,通常没有什么区别。我两者都有,并使用普通磁盘来存储我的所有数据。我有时确实会处理大型数据集,但从未遇到过问题。当然,如果我必须非常繁重,我只会在我们的服务器上工作。
因此,当我们谈论大量数据时,这可能会有所不同,但即便如此,我仍然非常怀疑磁盘访问是否是限制因素。除非你不断地读取和写入磁盘,但我想说你应该重新开始思考你到底在做什么。与其把钱花在 SDD 驱动器上,额外的内存可能是更好的选择。或者只是说服老板给你一个像样的计算服务器。
使用伪造数据帧进行时序实验,在 SSD 磁盘与普通磁盘上以文本格式与二进制格式进行读写。
My 2 cents: SSD only pays off if your applications are stored on it, not your data. And even then only if a lot of access to disk is necessary, like for an OS. People are right to point you to profiling. I can tell you without doing it that almost all of the reading time goes to processing, not to reading on the disk.
It pays off far more to think about the format of your data instead of where it's stored. A speedup in reading your data can be obtained by using the right applications and the right format. Like using R's internal format instead of fumbling around with text files. Make that an exclamation mark: never keep on fumbling around with text files. Go binary if speed is what you need.
Due to the overhead, it generally doesn't make a difference if you have an SSD or a normal disk to read your data from. I have both, and use the normal disk for all my data. I do juggle around big datasets sometimes, and never experienced a problem with it. Off course, if I have to go really heavy, I just work on our servers.
So it might make a difference when we're talking gigs and gigs of data, but even then I doubt very much that disk access is the limiting factor. Unless your continuously reading and writing to the disk, but then I'd say you should start thinking again about what exactly you're doing. Instead of spending that money on SDD drives, extra memory could be the better option. Or just convince the boss to get you a decent calculation server.
A timing experiment using a bogus data frame, and reading and writing in text format vs. binary format on a SSD disk vs. a normal disk.
http://www.codinghorror.com/blog /2010/09/revisiting-solid-state-hard-drives.html
有一篇关于 SSD 的好文章,评论提供了很多见解。
取决于您正在进行的分析类型,是 CPU 密集型还是 IO 密集型。
处理回归模型的个人经验告诉我,前者更常见,那么 SSD 就没有多大用处了。
简而言之,最好首先分析您的应用程序。
http://www.codinghorror.com/blog/2010/09/revisiting-solid-state-hard-drives.html
has a good article on SSDs, comments offer alot of insights.
Depends on the type of analysis you're doing, whether it's CPU bound or IO bound.
Personal experience dealing with regression modelling tells me former is more often the case, SSDs wouldn't be of much use then.
In short, best to profile your application first.
抱歉,但我不同意@joris 评分最高的答案。确实,如果运行该代码,二进制版本几乎需要零时间来编写。但那是因为测试集很奇怪。每行的大列“longtext”都是相同的。 R 中的数据框足够智能,不会多次存储重复值(通过因子)。
所以最后我们以 700MB 的文本文件与 335K 的二进制文件结束(当然二进制文件要快得多 xD)
但是,如果我们尝试使用随机数据
,文件并没有那么不同,
正如您所看到的,经过的时间并不是总和用户+系统......所以磁盘在这两种情况下都是瓶颈。是的,二进制存储总是会更快,因为您不必包含分号、引号或类似的人员,而只需将内存对象转储到磁盘。
但磁盘总有一个点成为瓶颈。我的测试是在研究服务器中运行的,通过 NAS 解决方案,我们的磁盘读/写时间超过 600MB/s。如果您在笔记本电脑上执行相同的操作,那么速度很难超过 50MB/s,您会注意到其中的差异。
因此,如果您实际上必须处理真正的大数据(并且重复一百万次相同的一千个字符串并不是大数据),当数据的二进制转储超过 1 GB 时,您会喜欢拥有一个好的磁盘(SSD是一个不错的选择)用于读取输入数据并将结果写回磁盘。
Sorry but I have to disagree with most rated answer by @joris. It's true that if you run that code, binary version almost takes zero time to be written. But that's because the test set is weird. The big columm 'longtext' is the same for every row. Data frames in R are smart enough no to store duplicate values more than once (via factors).
So at the end we finish with a text file of 700MB versus a binary file of 335K (Of course binary is much faster xD)
However if we try with random data
and files are not that different
As you see, elapsed time is not the sum of user+system...so the disk is the bottleneck in both cases. Yes binary storing will always be faster since you don't have to include semicolon, quotes or staff like that, but just dumping memory object to disk.
BUT there is always a point where disk becomes bottleneck. My test was ran in a research server where via NAS solution we get disk read/write times over 600MB/s. If you do the same in your laptop, where is hard to go over 50MB/s, you'll note the difference.
So, if you actually have to deal with real bigData (and repeating one million times the same thousand character string is not big data), when the binary dump of the data is over 1 GB, you'll appreciate having a good disk (SSD is a good choice) for reading input data and writing results back to disk.
我必须赞同约翰的建议来分析您的申请。我的经验是,缓慢的部分并不是实际的数据读取,而是创建编程对象来包含数据、从字符串进行转换、内存分配等的开销。
我强烈建议您首先分析您的代码,并考虑使用替代库(如 numpy),看看在投资硬件之前可以获得哪些改进。
I have to second John's suggestion to profile your application. My experience is that it isn't the actual data reads that are the slow part, it's the overhead of creating the programming objects to contain the data, casting from strings, memory allocation, etc.
I would strongly suggest you profile your code first, and consider using alternative libraries (like numpy) to see what improvements you can get before you invest in hardware.
SSD 的读取和写入时间明显高于标准 7200 RPM 磁盘(使用 10k RPM 磁盘仍然值得,不确定它比 15k RPM 改进了多少)。所以,是的,您的数据访问速度会更快。
性能的提升是不可否认的。那么,这是一个经济问题。 2TB 7200 RPM 磁盘售价为每块 170 美元,100GB SSSD 售价为每块 210 美元。因此,如果您有大量数据,则可能会遇到问题。
如果您读取/写入大量数据,请购买 SSD。然而,如果应用程序是 CPU 密集型的,那么您会从获得更好的处理器中受益更多。
The read and write times for SSDs are significantly higher than standard 7200 RPM disks (it's still worth it with a 10k RPM disk, not sure how much of an improvement it is over a 15k). So, yes, you'd get much faster times on data access.
The performance improvement is undeniable. Then, it's a question of economics. 2TB 7200 RPM disks are $170 a piece, and 100GB SSDS cost $210. So if you have a lot of data, you may run into a problem.
If you read/write a lot of data, get an SSD. If the application is CPU intensive, however, you'd benefit much more from getting a better processor.