分段读取 CSV 文件的策略?
我的计算机上有一个中等大小的文件 (4GB CSV),但没有足够的 RAM 来读取该文件(64 位 Windows 上为 8GB)。在过去,我只是将其加载到集群节点上并读入,但我的新集群似乎任意将进程限制为 4GB RAM(尽管每台机器的硬件有 16GB),所以我需要一个短期修复。
有没有办法将 CSV 文件的一部分读入 R 以适应可用内存限制?这样我就可以一次读入文件的三分之一,将其子集化为我需要的行和列,然后读入下一个三分之一?
感谢评论者指出我可以使用一些大的内存技巧来读取整个文件: 在 R 中快速读取非常大的表作为数据帧
I可以想到一些其他的解决方法(例如,在一个好的文本编辑器中打开,删除 2/3 的观察结果,然后加载到 R 中),但如果可能的话,我宁愿避免它们。
因此,分段阅读似乎仍然是目前最好的方法。
I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.
Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?
Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks:
Quickly reading very large tables as dataframes in R
I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.
So reading it in pieces still seems like the best way to go for now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看此线程后,我注意到没有提到此问题的明显解决方案。使用连接!
1) 打开与文件的连接
2) 使用 read.csv 读取代码块
旁注:定义 colClasses 将大大加快速度。确保将不需要的列定义为 NULL。
3) 做你需要做的事情
4) 重复。
5) 关闭连接
这种方式的优点是连接。如果省略此步骤,可能会减慢速度。通过手动打开连接,您实际上打开了数据集,并且在调用 close 函数之前不会将其关闭。这意味着当您循环访问数据集时,您将永远不会失去您的位置。假设您有一个包含 1e7 行的数据集。还假设您想要一次加载 1e5 行的块。由于我们打开连接,我们通过运行 read.csv(con, nrow=1e5,...) 获取前 1e5 行,然后为了获取第二个块,我们运行 read.csv( con, nrow=1e5,...) 以及,等等...
如果我们不使用连接,我们将以相同的方式获取第一个块,
read.csv("file .csv", nrow=1e5,...)
,但是对于下一个块,我们需要read.csv("file.csv",skip = 1e5, nrow=2e5,...)
。显然这是低效的。尽管我们刚刚读取了 1e5 行,但我们必须重新找到 1e5+1 行。最后,
data.table::fread
很棒。但你不能通过它连接。所以这个方法行不通。我希望这对某人有帮助。
更新
人们不断对这篇文章进行投票,所以我想我应该再补充一个简短的想法。新的
readr::read_csv
与read.csv
一样,可以通过连接传递。然而,它宣传大约快了 10 倍。After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!
1) Open a connection to your file
2) Read in chunks of code with read.csv
Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.
3) Do what ever you need to do
4) Repeat.
5) Close the connection
The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running
read.csv(con, nrow=1e5,...)
, then to get the second chunk we runread.csv(con, nrow=1e5,...)
as well, and so on....If we did not use the connections we would get the first chunk the same way,
read.csv("file.csv", nrow=1e5,...)
, however for the next chunk we would need toread.csv("file.csv", skip = 1e5, nrow=2e5,...)
. Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.Finally,
data.table::fread
is great. But you can not pass it connections. So this approach does not work.I hope this helps someone.
UPDATE
People keep upvoting this post so I thought I would add one more brief thought. The new
readr::read_csv
, likeread.csv
, can be passed connections. However, it is advertised as being roughly 10x faster.例如,您可以使用 RSQLite 将其读入数据库,然后使用 sql 语句获取一部分。
如果您只需要单个部分,则 sqldf 包中的
read.csv.sql
会将数据读入 sqlite 数据库。首先,它为您创建数据库,并且数据不经过 R,因此 R 的限制不适用(在这种情况下主要是 RAM)。其次,将数据加载到数据库后,sqldf将指定sql语句的输出读取到R中,最后销毁数据库。根据它处理数据的速度,如果您有多个部分,您也许可以对每个部分重复整个过程。只需一行代码即可完成所有三个步骤,因此尝试一下是理所当然的。
请参阅
?read.csv.sql
和?sqldf
以及 sqldf 主页。You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.
If you need only a single portion then
read.csv.sql
in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.
See
?read.csv.sql
and?sqldf
and also the sqldf home page.