当前位置：文江博客话题详情

Python r data-analysis solid-state-drive

使用 R/python 和 SSD 进行数据分析

发布于 2024-10-04 06:30:24 字数 117 浏览 13 评论 0原文

有谁有使用 r/python 处理存储在固态硬盘中的数据的经验吗？如果您主要进行读取操作，理论上这应该会显着缩短大型数据集的加载时间。我想知道这是否属实，以及是否值得投资 SSD 以提高数据密集型应用程序的 IO 速率。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蝶舞 2024-10-11 06:30:24

我的 2 美分：只有当您的应用程序存储在 SSD 上而不是您的数据时，SSD 才有价值。即使如此，只有在需要大量访问磁盘的情况下，例如操作系统。人们向您指出分析是正确的。我可以不做就告诉你，几乎所有的读取时间都花在了处理上，而不是在磁盘上读取。

考虑数据的格式而不是数据的存储位置会带来更大的回报。通过使用正确的应用程序和正确的格式，可以加快读取数据的速度。就像使用 R 的内部格式而不是摸索文本文件一样。让它成为一个感叹号：永远不要继续摸索文本文件。如果您需要速度，请选择二进制。

由于开销，如果您有 SSD 或普通磁盘来读取数据，通常没有什么区别。我两者都有，并使用普通磁盘来存储我的所有数据。我有时确实会处理大型数据集，但从未遇到过问题。当然，如果我必须非常繁重，我只会在我们的服务器上工作。

因此，当我们谈论大量数据时，这可能会有所不同，但即便如此，我仍然非常怀疑磁盘访问是否是限制因素。除非你不断地读取和写入磁盘，但我想说你应该重新开始思考你到底在做什么。与其把钱花在 SDD 驱动器上，额外的内存可能是更好的选择。或者只是说服老板给你一个像样的计算服务器。

使用伪造数据帧进行时序实验，在 SSD 磁盘与普通磁盘上以文本格式与二进制格式进行读写。

> tt <- 100
> longtext <- paste(rep("dqsdgfmqslkfdjiehsmlsdfkjqsefr",1000),collapse="")
> test <- data.frame(
+     X1=rep(letters,tt),
+     X2=rep(1:26,tt),
+     X3=rep(longtext,26*tt)
+ )

> SSD <- "C:/Temp" # My ssd disk with my 2 operating systems on it.
> normal <- "F:/Temp" # My normal disk, I use for data

> # Write text 
> system.time(write.table(test,file=paste(SSD,"test.txt",sep="/")))
   user  system elapsed 
   5.66    0.50    6.24 

> system.time(write.table(test,file=paste(normal,"test.txt",sep="/")))
   user  system elapsed 
   5.68    0.39    6.08 

> # Write binary
> system.time(save(test,file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(save(test,file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> # Read text 
> system.time(read.table(file=paste(SSD,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.57    0.05    8.61 

> system.time(read.table(file=paste(normal,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.53    0.09    8.63 

> # Read binary
> system.time(load(file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(load(file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0

My 2 cents: SSD only pays off if your applications are stored on it, not your data. And even then only if a lot of access to disk is necessary, like for an OS. People are right to point you to profiling. I can tell you without doing it that almost all of the reading time goes to processing, not to reading on the disk.

It pays off far more to think about the format of your data instead of where it's stored. A speedup in reading your data can be obtained by using the right applications and the right format. Like using R's internal format instead of fumbling around with text files. Make that an exclamation mark: never keep on fumbling around with text files. Go binary if speed is what you need.

Due to the overhead, it generally doesn't make a difference if you have an SSD or a normal disk to read your data from. I have both, and use the normal disk for all my data. I do juggle around big datasets sometimes, and never experienced a problem with it. Off course, if I have to go really heavy, I just work on our servers.

So it might make a difference when we're talking gigs and gigs of data, but even then I doubt very much that disk access is the limiting factor. Unless your continuously reading and writing to the disk, but then I'd say you should start thinking again about what exactly you're doing. Instead of spending that money on SDD drives, extra memory could be the better option. Or just convince the boss to get you a decent calculation server.

A timing experiment using a bogus data frame, and reading and writing in text format vs. binary format on a SSD disk vs. a normal disk.

> tt <- 100
> longtext <- paste(rep("dqsdgfmqslkfdjiehsmlsdfkjqsefr",1000),collapse="")
> test <- data.frame(
+     X1=rep(letters,tt),
+     X2=rep(1:26,tt),
+     X3=rep(longtext,26*tt)
+ )

> SSD <- "C:/Temp" # My ssd disk with my 2 operating systems on it.
> normal <- "F:/Temp" # My normal disk, I use for data

> # Write text 
> system.time(write.table(test,file=paste(SSD,"test.txt",sep="/")))
   user  system elapsed 
   5.66    0.50    6.24 

> system.time(write.table(test,file=paste(normal,"test.txt",sep="/")))
   user  system elapsed 
   5.68    0.39    6.08 

> # Write binary
> system.time(save(test,file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(save(test,file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> # Read text 
> system.time(read.table(file=paste(SSD,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.57    0.05    8.61 

> system.time(read.table(file=paste(normal,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.53    0.09    8.63 

> # Read binary
> system.time(load(file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(load(file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0

回复收藏 0 原文

尴尬癌患者 2024-10-11 06:30:24

http://www.codinghorror.com/blog /2010/09/revisiting-solid-state-hard-drives.html
有一篇关于 SSD 的好文章，评论提供了很多见解。

取决于您正在进行的分析类型，是 CPU 密集型还是 IO 密集型。
处理回归模型的个人经验告诉我，前者更常见，那么 SSD 就没有多大用处了。

简而言之，最好首先分析您的应用程序。

回复收藏 0 原文

巨坚强 2024-10-11 06:30:24

抱歉，但我不同意@joris 评分最高的答案。确实，如果运行该代码，二进制版本几乎需要零时间来编写。但那是因为测试集很奇怪。每行的大列“longtext”都是相同的。 R 中的数据框足够智能，不会多次存储重复值（通过因子）。

所以最后我们以 700MB 的文本文件与 335K 的二进制文件结束（当然二进制文件要快得多 xD）

-rw-r--r-- 1 carlos carlos 335K Jun  4 08:46 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:46 test.txt

但是，如果我们尝试使用随机数据

> longtext<-paste(sample(c(0:9, letters, LETTERS),1000*nchar('dqsdgfmqslkfdjiehsmlsdfkjqsefr'), replace=TRUE),collapse="")
> test$X3<-rep(longtext,26*tt)
> 
> system.time(write.table(test,file='test.txt'))
   user  system elapsed 
  2.119   0.476   4.723 
> system.time(save(test,file='test.RData'))
   user  system elapsed 
  0.229   0.879   3.069

，文件并没有那么不同，

-rw-r--r-- 1 carlos carlos 745M Jun  4 08:52 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:52 test.txt

正如您所看到的，经过的时间并不是总和用户+系统......所以磁盘在这两种情况下都是瓶颈。是的，二进制存储总是会更快，因为您不必包含分号、引号或类似的人员，而只需将内存对象转储到磁盘。

但磁盘总有一个点成为瓶颈。我的测试是在研究服务器中运行的，通过 NAS 解决方案，我们的磁盘读/写时间超过 600MB/s。如果您在笔记本电脑上执行相同的操作，那么速度很难超过 50MB/s，您会注意到其中的差异。

因此，如果您实际上必须处理真正的大数据（并且重复一百万次相同的一千个字符串并不是大数据），当数据的二进制转储超过 1 GB 时，您会喜欢拥有一个好的磁盘（SSD是一个不错的选择）用于读取输入数据并将结果写回磁盘。

Sorry but I have to disagree with most rated answer by @joris. It's true that if you run that code, binary version almost takes zero time to be written. But that's because the test set is weird. The big columm 'longtext' is the same for every row. Data frames in R are smart enough no to store duplicate values more than once (via factors).

So at the end we finish with a text file of 700MB versus a binary file of 335K (Of course binary is much faster xD)

-rw-r--r-- 1 carlos carlos 335K Jun  4 08:46 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:46 test.txt

However if we try with random data

> longtext<-paste(sample(c(0:9, letters, LETTERS),1000*nchar('dqsdgfmqslkfdjiehsmlsdfkjqsefr'), replace=TRUE),collapse="")
> test$X3<-rep(longtext,26*tt)
> 
> system.time(write.table(test,file='test.txt'))
   user  system elapsed 
  2.119   0.476   4.723 
> system.time(save(test,file='test.RData'))
   user  system elapsed 
  0.229   0.879   3.069

and files are not that different

-rw-r--r-- 1 carlos carlos 745M Jun  4 08:52 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:52 test.txt

As you see, elapsed time is not the sum of user+system...so the disk is the bottleneck in both cases. Yes binary storing will always be faster since you don't have to include semicolon, quotes or staff like that, but just dumping memory object to disk.

BUT there is always a point where disk becomes bottleneck. My test was ran in a research server where via NAS solution we get disk read/write times over 600MB/s. If you do the same in your laptop, where is hard to go over 50MB/s, you'll note the difference.

So, if you actually have to deal with real bigData (and repeating one million times the same thousand character string is not big data), when the binary dump of the data is over 1 GB, you'll appreciate having a good disk (SSD is a good choice) for reading input data and writing results back to disk.

回复收藏 0 原文