所以我有一个数据文件(分号分隔),其中有很多详细信息和不完整的行(导致 Access 和 SQL 阻塞)。它是 40 年的县级数据集,分为分段、子分段和子分段(总共约 200 个因子)。简而言之,它很大,如果我试图简单地阅读它,它就无法装入内存。
所以我的问题是这样的,鉴于我想要所有的县,但只想要一年(并且只是最高级别的段......最终导致大约 100,000 行),获得的最佳方法是什么这个汇总到 R 中?
目前,我正在尝试用 Python 剔除不相关的年份,通过一次读取和操作一行来绕过文件大小限制,但我更喜欢仅 R 的解决方案(CRAN 包可以)。在 R 中是否有类似的方法一次读取文件中的一部分?
任何想法将不胜感激。
更新:
- 约束
- 需要使用我的机器,因此没有 EC2 实例
- 尽可能仅使用 R。在这种情况下,速度和资源不是问题...只要我的机器不会爆炸...
- 如下所示,数据包含混合类型,我需要稍后对其进行操作
- 数据
- 数据为 3.5GB,大约有850 万行和 17 列
- 几千行 (~2k) 格式错误,只有一列而不是 17 列
- 我只需要此文件中的约 100,000 行(见下文)
数据示例:
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC [Malformed row]
[8.5 Mill rows]
我想删除一些列并从 40 个可用年份中选择两个(从 1980-2020 年到 2009-2010 年),以便数据可以适合 R:
County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]
结果:
之后修改了所有提出的建议后,我认为 JD 和 Marek 建议的 readLines 效果最好。我给了马雷克支票,因为他提供了示例实现。
我在这里为我的最终答案重现了 Marek 实现的稍微改编版本,使用 strsplit 和 cat 只保留我想要的列。
还应该注意的是,这比 Python 效率低得多……例如,Python 在 5 分钟内处理完 3.5GB 文件,而 R 大约需要 60 分钟……但如果你只有 R那么这就是门票。
## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
line.split <- strsplit(line, ';')
if (length(line.split[[1]]) > 1) {
if (line.split[[1]][3] == '2009') {
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
}
}
line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)
失败的方法:
- sqldf
- 如果数据格式良好,这绝对是我将来解决此类问题的方法。然而,如果不是,那么 SQLite 就会窒息。
- MapReduce
- 说实话,文档在这方面让我有点害怕,所以我没有抽出时间去尝试。看起来它也要求该对象也存在于内存中,如果是这样的话,这就违背了这一点。
- bigmemory
- 这种方法干净地链接到数据,但它一次只能处理一种类型。结果,当放入 big.table 中时,我的所有字符向量都丢失了。如果我需要为未来设计大型数据集,我会考虑仅使用数字来保持此选项的活力。
- scan
- Scan 似乎与大内存有类似的类型问题,但具有 readLines 的所有机制。总而言之,这次就是不符合要求。
So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.
So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?
Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?
Any ideas would be greatly appreciated.
Update:
- Constraints
- Needs to use my machine, so no EC2 instances
- As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
- As you can see below, the data contains mixed types, which I need to operate on later
- Data
- The data is 3.5GB, with about 8.5 million rows and 17 columns
- A couple thousand rows (~2k) are malformed, with only one column instead of 17
- These are entirely unimportant and can be dropped
- I only need ~100,000 rows out of this file (See below)
Data example:
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC [Malformed row]
[8.5 Mill rows]
I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:
County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]
Results:
After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.
I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.
It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.
## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
line.split <- strsplit(line, ';')
if (length(line.split[[1]]) > 1) {
if (line.split[[1]][3] == '2009') {
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
}
}
line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)
Failings by Approach:
- sqldf
- This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.
- MapReduce
- To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.
- bigmemory
- This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
- scan
- Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.
发布评论
评论(13)
我尝试使用
readLines
。这段代码使用选定的年份创建csv
。My try with
readLines
. This piece of a code createscsv
with selected years.我不是这方面的专家,但您可能会考虑尝试 MapReduce,这基本上意味着采用“分而治之”的方法。 R 为此有几个选项,包括:
或者,R 提供了几个包来处理内存之外(到磁盘上)的大数据 您可以将整个数据集加载到
bigmemory
对象中,并在 R 中完全进行缩减。请参阅 http ://www.bigmemory.org/ 获取一组处理此问题的工具。I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:
Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a
bigmemory
object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.是。 readChar() 函数将读取字符块,而不假设它们以 null 终止。如果您想一次读取一行数据,可以使用readLines()。如果读取一个块或一行,进行操作,然后将数据写出,就可以避免内存问题。不过,如果您想在 Amazon EC2 上启动大内存实例,您可以获得高达 64GB 的 RAM。这应该可以容纳您的文件以及足够的空间来操作数据。
如果您需要更快的速度,那么 Shane 建议使用 Map Reduce 是一个非常好的选择。但是,如果您选择在 EC2 上使用大内存实例,则应该查看多核包以使用计算机上的所有核心。
如果您发现自己想要将大量分隔数据读入 R,您至少应该研究 sqldf 包,它允许您从 R 直接导入 sqldf,然后在 R 中操作数据。我发现 sqldf 就是其中之一将大量数据导入 R 的最快方法,如 上一个问题。
Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.
If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.
If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.
有一个名为 colbycol 的全新软件包,可让您从巨大的文本文件中仅读取所需的变量:
http://colbycol.r-forge.r-project.org/
它将所有参数传递给 read.table,因此该组合应该让您可以非常紧密地子集化。
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
It passes any arguments along to read.table, so the combination should let you subset pretty tightly.
ff
包是处理大文件的透明方式。您可能会看到包网站和/或有关它的演示。
我希望这有帮助
The
ff
package is a transparent way to deal with huge files.You may see the package website and/or a presentation about it.
I hope this helps
使用
readr
和read__*_chunked
家庭?
因此,在您的情况下:
testfile.csv
实际代码
这会将
f
应用于每个块,记住列名称并将过滤结果合并到结尾。请参阅?callback
这是此示例的来源。这会导致:
您甚至可以增加
chunk_size
但在此示例中只有 4 行。What about using
readr
and theread_*_chunked
family?So in your case:
testfile.csv
Actual code
This applies
f
to each chunk, remembering the col-names and combining the filtered results in the end. See?callback
which is the source of this example.This results in:
You can even increase
chunk_size
but in this example there are only 4 lines.您可以将数据导入SQLite数据库,然后使用 RSQLite 选择子集。
You could import data to SQLite database and then use RSQLite to select subsets.
您是否考虑过 bigmemory ?
请查看此和此。
Have you consisered bigmemory ?
Check out this and this.
也许您可以迁移到 MySQL 或 PostgreSQL,以防止自己受到 MS Access 限制。
使用 DBI(在 CRAN 上可用)基于数据库连接器。
Perhaps you can migrate to MySQL or PostgreSQL to prevent youself from MS Access limitations.
It is quite easy to connect R to these systems with a DBI (available on CRAN) based database connector.
scan() 有一个 nlines 参数和一个skip 参数。是否有某种原因您可以使用它一次读取一大块行,检查日期以查看是否合适?如果输入文件按日期排序,您可以存储一个索引,告诉您跳过和 nlines 应该是什么,这将加快将来的过程。
scan() has both an nlines argument and a skip argument. Is there some reason you can just use that to read in a chunk of lines a time, checking the date to see if it's appropriate? If the input file is ordered by date, you can store an index that tells you what your skip and nlines should be that would speed up the process in the future.
如今,3.5GB 并不是真的那么大,我可以以 2.80 美元/小时的价格在亚马逊云上访问具有 244GB RAM (r3.8xlarge) 的机器。您需要多少小时才能弄清楚如何使用大数据类型的解决方案来解决问题?你的时间值多少钱?是的,您需要一两个小时才能弄清楚如何使用 AWS - 但您可以在免费套餐中学习基础知识,上传数据并将前 10k 行读入 R 以检查其是否有效,然后您可以启动像 r3.8xlarge 这样的大内存实例并将其全部读取!只是我的2c。
These days, 3.5GB just isn't really that big, I can get access to a machine with 244GB RAM (r3.8xlarge) on the Amazon cloud for $2.80/hour. How many hours will it take you to figure out how to solve the problem using big-data type solutions? How much is your time worth? Yes it will take you an hour or two to figure out how to use AWS - but you can learn the basics on a free tier, upload the data and read the first 10k lines into R to check it worked and then you can fire up a big memory instance like r3.8xlarge and read it all in! Just my 2c.
现在,2017 年,我建议使用 Spark 和 SparkR。
语法可以用简单的、与 dplyr 类似的方式编写
它非常适合小内存(小是 2017 年意义上的) )
但是,开始使用可能会是一种令人生畏的体验......
Now, 2017, I would suggest to go for spark and sparkR.
the syntax can be written in a simple rather dplyr-similar way
it fits quite well to small memory (small in the sense of 2017)
However, it may be an intimidating experience to get started...
我会寻找一个数据库,然后进行一些查询,通过 DBI 提取您需要的样本。
请避免将 3,5 GB 的 csv 文件导入 SQLite。或者至少仔细检查您的巨大数据库是否符合 SQLite 限制,http://www.sqlite.org/ limit.html
这是一个该死的大数据库。如果你需要速度,我会选择 MySQL。但要准备好等待很长时间才能完成导入。除非你有一些非常规的硬件或者你是从未来开始写作......
亚马逊的 EC2 也可能是实例化运行 R 和 MySQL 的服务器的一个很好的解决方案。
我的两便士值钱。
I would go for a DB and then make some queries to extract the samples you need via DBI
Please avoid importing a 3,5 GB csv file into SQLite. Or at least double check that your HUGE db fits into SQLite limits, http://www.sqlite.org/limits.html
It's a damn big DB you have. I would go for MySQL if you need speed. But be prepared to wait a lot of hours for the import to finish. Unless you have some unconventional hardware or you are writing from the future...
Amazon's EC2 could be a good solution also for instantiating a server running R and MySQL.
my two humble pennies worth.