当前位置：文江博客话题详情

CSV r

修剪一个巨大的 (3.5 GB) csv 文件以读入 R

发布于 2024-09-06 18:24:14 字数 2707 浏览 2 评论 0 原文

所以我有一个数据文件（分号分隔），其中有很多详细信息和不完整的行（导致 Access 和 SQL 阻塞）。它是 40 年的县级数据集，分为分段、子分段和子分段（总共约 200 个因子）。简而言之，它很大，如果我试图简单地阅读它，它就无法装入内存。

所以我的问题是这样的，鉴于我想要所有的县，但只想要一年（并且只是最高级别的段......最终导致大约 100,000 行），获得的最佳方法是什么这个汇总到 R 中？

目前，我正在尝试用 Python 剔除不相关的年份，通过一次读取和操作一行来绕过文件大小限制，但我更喜欢仅 R 的解决方案（CRAN 包可以）。在 R 中是否有类似的方法一次读取文件中的一部分？

任何想法将不胜感激。

更新：

约束

需要使用我的机器，因此没有 EC2 实例
尽可能仅使用 R。在这种情况下，速度和资源不是问题...只要我的机器不会爆炸...
如下所示，数据包含混合类型，我需要稍后对其进行操作

数据

数据为 3.5GB，大约有850 万行和 17 列
几千行 (~2k) 格式错误，只有一列而不是 17 列

这些完全不重要，可以删除

我只需要此文件中的约 100,000 行（见下文）

数据示例：

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC  [Malformed row]
[8.5 Mill rows]

我想删除一些列并从 40 个可用年份中选择两个（从 1980-2020 年到 2009-2010 年），以便数据可以适合 R：

County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]

结果：

之后修改了所有提出的建议后，我认为 JD 和 Marek 建议的 readLines 效果最好。我给了马雷克支票，因为他提供了示例实现。

我在这里为我的最终答案重现了 Marek 实现的稍微改编版本，使用 strsplit 和 cat 只保留我想要的列。

还应该注意的是，这比 Python 效率低得多……例如，Python 在 5 分钟内处理完 3.5GB 文件，而 R 大约需要 60 分钟……但如果你只有 R那么这就是门票。

## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
  line.split <- strsplit(line, ';')
  if (length(line.split[[1]]) > 1) {
    if (line.split[[1]][3] == '2009') {
        cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
    }
  }
  line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)

失败的方法：

sqldf

如果数据格式良好，这绝对是我将来解决此类问题的方法。然而，如果不是，那么 SQLite 就会窒息。

MapReduce

说实话，文档在这方面让我有点害怕，所以我没有抽出时间去尝试。看起来它也要求该对象也存在于内存中，如果是这样的话，这就违背了这一点。

bigmemory

这种方法干净地链接到数据，但它一次只能处理一种类型。结果，当放入 big.table 中时，我的所有字符向量都丢失了。如果我需要为未来设计大型数据集，我会考虑仅使用数字来保持此选项的活力。

scan

Scan 似乎与大内存有类似的类型问题，但具有 readLines 的所有机制。总而言之，这次就是不符合要求。

原文

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.

So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?

Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?

Any ideas would be greatly appreciated.

Update:

Constraints

Needs to use my machine, so no EC2 instances
As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
As you can see below, the data contains mixed types, which I need to operate on later

Data

The data is 3.5GB, with about 8.5 million rows and 17 columns
A couple thousand rows (~2k) are malformed, with only one column instead of 17

These are entirely unimportant and can be dropped

I only need ~100,000 rows out of this file (See below)

Data example:

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC  [Malformed row]
[8.5 Mill rows]

I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:

County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]

Results:

After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.

I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.

It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.

## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
  line.split <- strsplit(line, ';')
  if (length(line.split[[1]]) > 1) {
    if (line.split[[1]][3] == '2009') {
        cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
    }
  }
  line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)

Failings by Approach:

sqldf

This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.

MapReduce

To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.

bigmemory

This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.

scan

Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅唱ヾ落雨殇 2024-09-13 18:24:14

我尝试使用readLines。这段代码使用选定的年份创建 csv 。

file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers

B <- 300000 # depends how large is one pack
while(length(x)) {
    ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
    if (length(ind)) writeLines(x[ind], file_out)
    x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)

My try with readLines. This piece of a code creates csv with selected years.

file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers

B <- 300000 # depends how large is one pack
while(length(x)) {
    ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
    if (length(ind)) writeLines(x[ind], file_out)
    x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)

回复收藏 0 原文

在你怀里撒娇 2024-09-13 18:24:14

我不是这方面的专家，但您可能会考虑尝试 MapReduce，这基本上意味着采用“分而治之”的方法。 R 为此有几个选项，包括：

mapReduce （纯 R）
RHIPE （使用 Hadoop);请参阅文档中的示例 6.2.2 的示例。

或者，R 提供了几个包来处理内存之外（到磁盘上）的大数据您可以将整个数据集加载到 bigmemory 对象中，并在 R 中完全进行缩减。请参阅 http ://www.bigmemory.org/ 获取一组处理此问题的工具。

回复收藏 0 原文

木緿 2024-09-13 18:24:14

在 R 中是否有类似的方法一次读取文件的一部分？

是。 readChar() 函数将读取字符块，而不假设它们以 null 终止。如果您想一次读取一行数据，可以使用readLines()。如果读取一个块或一行，进行操作，然后将数据写出，就可以避免内存问题。不过，如果您想在 Amazon EC2 上启动大内存实例，您可以获得高达 64GB 的 RAM。这应该可以容纳您的文件以及足够的空间来操作数据。

如果您需要更快的速度，那么 Shane 建议使用 Map Reduce 是一个非常好的选择。但是，如果您选择在 EC2 上使用大内存实例，则应该查看多核包以使用计算机上的所有核心。

如果您发现自己想要将大量分隔数据读入 R，您至少应该研究 sqldf 包，它允许您从 R 直接导入 sqldf，然后在 R 中操作数据。我发现 sqldf 就是其中之一将大量数据导入 R 的最快方法，如上一个问题。

回复收藏 0 原文

伴我心暖 2024-09-13 18:24:14

有一个名为 colbycol 的全新软件包，可让您从巨大的文本文件中仅读取所需的变量：

http://colbycol.r-forge.r-project.org/

它将所有参数传递给 read.table，因此该组合应该让您可以非常紧密地子集化。

回复收藏 0 原文

情独悲 2024-09-13 18:24:14

ff 包是处理大文件的透明方式。

您可能会看到包网站和/或有关它的演示。

我希望这有帮助

回复收藏 0 原文

放血 2024-09-13 18:24:14

使用 readr 和 read__*_chunked家庭？

因此，在您的情况下：

testfile.csv

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5

实际代码

require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)

这会将 f 应用于每个块，记住列名称并将过滤结果合并到结尾。请参阅 ?callback 这是此示例的来源。

这会导致：

# A tibble: 2 × 8
      County State  Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment`   GDP
*      <chr> <chr> <int>   <int>   <chr>         <chr>             <chr> <dbl>
1 Ada County    NC  2009       4    FIRE     Financial             Banks   801
2 Ada County    NC  2010       1    FIRE     Financial             Banks   825

您甚至可以增加 chunk_size 但在此示例中只有 4 行。

What about using readr and the read_*_chunked family?

So in your case:

testfile.csv

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5

Actual code

require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)

This applies f to each chunk, remembering the col-names and combining the filtered results in the end. See ?callback which is the source of this example.

This results in:

# A tibble: 2 × 8
      County State  Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment`   GDP
*      <chr> <chr> <int>   <int>   <chr>         <chr>             <chr> <dbl>
1 Ada County    NC  2009       4    FIRE     Financial             Banks   801
2 Ada County    NC  2010       1    FIRE     Financial             Banks   825

You can even increase chunk_size but in this example there are only 4 lines.

回复收藏 0 原文

一抹淡然 2024-09-13 18:24:14

您可以将数据导入SQLite数据库，然后使用 RSQLite 选择子集。

回复收藏 0 原文

煮酒 2024-09-13 18:24:14

您是否考虑过 bigmemory ？
请查看此和此。

回复收藏 0 原文

别挽留 2024-09-13 18:24:14

也许您可以迁移到 MySQL 或 PostgreSQL，以防止自己受到 MS Access 限制。

使用 DBI（在 CRAN 上可用）基于数据库连接器。

回复收藏 0 原文

往事风中埋 2024-09-13 18:24:14

scan() 有一个 nlines 参数和一个skip 参数。是否有某种原因您可以使用它一次读取一大块行，检查日期以查看是否合适？如果输入文件按日期排序，您可以存储一个索引，告诉您跳过和 nlines 应该是什么，这将加快将来的过程。

回复收藏 0 原文

或十年 2024-09-13 18:24:14

如今，3.5GB 并不是真的那么大，我可以以 2.80 美元/小时的价格在亚马逊云上访问具有 244GB RAM (r3.8xlarge) 的机器。您需要多少小时才能弄清楚如何使用大数据类型的解决方案来解决问题？你的时间值多少钱？是的，您需要一两个小时才能弄清楚如何使用 AWS - 但您可以在免费套餐中学习基础知识，上传数据并将前 10k 行读入 R 以检查其是否有效，然后您可以启动像 r3.8xlarge 这样的大内存实例并将其全部读取！只是我的2c。

回复收藏 0 原文

想挽留 2024-09-13 18:24:14

现在，2017 年，我建议使用 Spark 和 SparkR。

语法可以用简单的、与 dplyr 类似的方式编写
它非常适合小内存（小是 2017 年意义上的））

但是，开始使用可能会是一种令人生畏的体验......

回复收藏 0 原文

酒浓于脸红 2024-09-13 18:24:14

我会寻找一个数据库，然后进行一些查询，通过 DBI 提取您需要的样本。

请避免将 3,5 GB 的 csv 文件导入 SQLite。或者至少仔细检查您的巨大数据库是否符合 SQLite 限制，http://www.sqlite.org/ limit.html

这是一个该死的大数据库。如果你需要速度，我会选择 MySQL。但要准备好等待很长时间才能完成导入。除非你有一些非常规的硬件或者你是从未来开始写作......

亚马逊的 EC2 也可能是实例化运行 R 和 MySQL 的服务器的一个很好的解决方案。

我的两便士值钱。

回复收藏 0 原文

~没有更多了~

关于作者

烏雲後面有陽光

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

修剪一个巨大的 (3.5 GB) csv 文件以读入 R

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（13）

关于作者

相关话题

热门标签

推荐作者

qq_E2Iff7

Archangel

freedog

Hunk

18819270189

wenkai

友情链接

修剪一个巨大的 (3.5 GB) csv 文件以读入 R

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（13）

关于作者

相关话题

热门标签

推荐作者

qq_E2Iff7

Archangel

freedog

Hunk

18819270189

wenkai

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。