分段读取 CSV 文件的策略?

发布于 2025-01-06 22:18:56 字数 512 浏览 4 评论 0原文

我的计算机上有一个中等大小的文件 (4GB CSV),但没有足够的 RAM 来读取该文件(64 位 Windows 上为 8GB)。在过去,我只是将其加载到集群节点上并读入,但我的新集群似乎任意将进程限制为 4GB RAM(尽管每台机器的硬件有 16GB),所以我需要一个短期修复。

有没有办法将 CSV 文件的一部分读入 R 以适应可用内存限制?这样我就可以一次读入文件的三分之一,将其子集化为我需要的行和列,然后读入下一个三分之一?

感谢评论者指出我可以使用一些大的内存技巧来读取整个文件: 在 R 中快速读取非常大的表作为数据帧

I可以想到一些其他的解决方法(例如,在一个好的文本编辑器中打开,删除 2/3 的观察结果,然后加载到 R 中),但如果可能的话,我宁愿避免它们。

因此,分段阅读似乎仍然是目前最好的方法。

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.

Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?

Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks:
Quickly reading very large tables as dataframes in R

I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.

So reading it in pieces still seems like the best way to go for now.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夜光 2025-01-13 22:18:56

查看此线程后,我注意到没有提到此问题的明显解决方案。使用连接!

1) 打开与文件的连接

con = file("file.csv", "r")

2) 使用 read.csv 读取代码块

read.csv(con, nrows="CHUNK SIZE",...)

旁注:定义 colClasses 将大大加快速度。确保将不需要的列定义为 NULL。

3) 做你需要做的事情

4) 重复。

5) 关闭连接

close(con)

这种方式的优点是连接。如果省略此步骤,可能会减慢速度。通过手动打开连接,您实际上打开了数据集,并且在调用 close 函数之前不会将其关闭。这意味着当您循环访问数据集时,您将永远不会失去您的位置。假设您有一个包含 1e7 行的数据集。还假设您想要一次加载 1e5 行的块。由于我们打开连接,我们通过运行 read.csv(con, nrow=1e5,...) 获取前 1e5 行,然后为了获取第二个块,我们运行 read.csv( con, nrow=1e5,...) 以及,等等...

如果我们不使用连接,我们将以相同的方式获取第一个块,read.csv("file .csv", nrow=1e5,...),但是对于下一个块,我们需要 read.csv("file.csv",skip = 1e5, nrow=2e5,...)。显然这是低效的。尽管我们刚刚读取了 1e5 行,但我们必须重新找到 1e5+1 行。

最后,data.table::fread 很棒。但你不能通过它连接。所以这个方法行不通。

我希望这对某人有帮助。

更新

人们不断对这篇文章进行投票,所以我想我应该再补充一个简短的想法。新的 readr::read_csvread.csv 一样,可以通过连接传递。然而,它宣传大约快了 10 倍。

After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!

1) Open a connection to your file

con = file("file.csv", "r")

2) Read in chunks of code with read.csv

read.csv(con, nrows="CHUNK SIZE",...)

Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.

3) Do what ever you need to do

4) Repeat.

5) Close the connection

close(con)

The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....

If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.

Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.

I hope this helps someone.

UPDATE

People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.

高跟鞋的旋律 2025-01-13 22:18:56

例如,您可以使用 RSQLite 将其读入数据库,然后使用 sql 语句获取一部分。

如果您只需要单个部分,则 sqldf 包中的 read.csv.sql 会将数据读入 sqlite 数据库。首先,它为您创建数据库,并且数据不经过 R,因此 R 的限制不适用(在这种情况下主要是 RAM)。其次,将数据加载到数据库后,sqldf将指定sql语句的输出读取到R中,最后销毁数据库。根据它处理数据的速度,如果您有多个部分,您也许可以对每个部分重复整个过程。

只需一行代码即可完成所有三个步骤,因此尝试一下是理所当然的。

DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)

请参阅 ?read.csv.sql?sqldf 以及 sqldf 主页

You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.

If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.

Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.

DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)

See ?read.csv.sql and ?sqldf and also the sqldf home page.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文