如何将一大列日期对象分配/附加到数据框
我有一个名为 tr.sql 的数据框(3 列,12146637 行),占用 184Mb。 (它由 SQL 支持,它是我通过 read.csv.sql 读取的数据集的内容)
第 2 列是 tr.sql$visit_date。 SQL 不允许将日期本地表示为 R Date 对象,这对于我需要如何处理数据很重要。
因此我想将 tr.sql 的内容复制到新的数据框 tr (其中访问日期列可以本机表示为日期(chron::Date?)。相信我,这使探索性数据分析变得更容易,目前这就是我想要的方式 - 我最终可能会使用本机 SQL,但请不要'现在不要对此争论。)
这是我的解决方案(感谢 gsk 和大家)+解决方法:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
然后解决方法复制 tr.sql ->使用 for 循环将 tr 分成(例如)N/8 块,以便 str->Date 转换中涉及的临时数据不会内存不足,并且在每个之后进行垃圾收集:
for (i in 0 :7) { 从 <- 楼层(i*N/8) 到 <- 楼层((i+1)*N/8) -1 如果(i==7) 至<-N print(c("复制 tr.sql$visit_date",from,to," ...")) tr$visit_date[从:到] <- as.Date(tr.sql$visit_date[从:到]) GC() } rm(tr.sql) memsize_gc() ... # 最终只有 321 Mb! (复制期间约为 1Gb)
问题是分配然后复制访问日期列。 这是数据集和代码,我遇到了多个单独的问题,解释如下:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
和代码:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
- 尝试声明
data.frame(visit_date = Date(N))
的第二行失败,我不知道 Date 对象的命名空间的正确限定名称(尝试了 chron::Date 、 Dates::Date ?不起作用) - 分配和追加的尝试都失败。甚至不确定在数据帧的单个大列上使用追加是否合法或有效。
请记住这些对象很大,因此请避免使用临时对象。 提前致谢...
I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb.
(it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr
(where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) {
from <- floor(i*N/8)
to <- floor((i+1)*N/8) -1
if (i==7)
to <- N
print(c("Copying tr.sql$visit_date",from,to," ..."))
tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to])
gc()
}
rm(tr.sql)
memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column.
Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
and code:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `lt;-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
- The second line that tries to declare
data.frame(visit_date = Date(N))
fails, I don't know the correct qualified name with namespace for Date object (triedchron::Date , Dates::Date
? don't work) - Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries.
Thanks in advance...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
请尝试执行此操作,确保您使用的是最新版本的 sqldf(当前版本为 0.4-1.2)。
(如果您发现内存不足,请尝试通过在
read.csv.sql
调用中添加dbname = tempfile()
参数将数据库放在磁盘上。如果甚至如果失败了,那么它相对于可用内存来说太大了,无论如何你都不可能用它做很多分析。)Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).
(If you find you are running out of memory try putting the database on disk by adding the
dbname = tempfile()
argument to theread.csv.sql
call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)在我看来,你那里没有一个 data.frame (N 是长度为 1 的向量)。应该很简单:
或者甚至更有效:
作为旁注,您的标题说“追加”,但我认为这不是您想要的操作。您正在使 data.frame 更宽,而不是将它们附加到末尾(使其更长)。从概念上讲,这是一个
cbind()
操作。It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:
Or even more efficient:
As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a
cbind()
operation.试试这个:
如果您的格式是 YYYY-MM-DD 或 YYYY/MM/DD,这将会成功。如果不是这些格式之一,则发布更多详细信息。如果 tr.sql$visit_date 是等于原点之后天数的数值向量,它也会成功。例如:
Try this:
This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g: