R:结合 apply.weekly() 按类别对数据集进行分组/拆分
简介
我还不是 R 专家,所以请原谅我可能不好意思问的另一个问题。在 另一个问题< /a> 我在 stackoverflow 上询问,我得到了一些非常有用的评论,关于如何将 xts 对象的不规则每日数据聚合为每周值 apply.weekly()
函数。不幸的是,我没有找到像 tapply()
、ddply()
、by()
或 aggregate()
这样的函数code> 允许按类别进行拆分,与 apply.weekly()
函数一起使用。
我的数据
这是我的示例数据集。我已经在另一个问题中发帖了。出于说明目的,我冒昧地将其也发布在这里:
example <- as.data.frame(structure(c(" 1", " 2", " 1", " 2", " 1", " 1", " 2", " 1", " 2",
" 1", " 2", " 3", " 1", " 1", " 2", " 2", " 3", " 1", " 2", " 2",
" 1", " 2", " 1", " 1", " 2", NA, " 2", NA, NA, " 1", " 3", " 1",
" 3", " 3", " 2", " 3", " 3", " 3", " 2", " 2", " 2", " 3", " 3",
" 3", " 2", " 2", " 3", " 3", " 3", " 3", " 1", " 2", " 1", " 2",
" 2", " 1", " 2", " 1", " 2", " 2", " 2", " 3", " 1", " 1", " 2",
" 2", " 3", " 3", " 2", " 2", " 1", " 2", " 1", " 1", " 2", NA,
" 2", NA, NA, " 1", " 3", " 2", " 3", " 2", " 0", " 3", " 3",
" 3", " 2", " 0", " 2", " 3", " 3", " 3", " 0", " 2", " 2", " 3",
" 3", " 0", "12", " 5", " 9", "14", " 5", "tra", "tra", "man",
"inf", "agc", "07-2011", "07-2011", "07-2011", "07-2011", "07-2011"
), .indexCLASS = c("POSIXlt", "POSIXt"), .indexTZ = "", class = c("xts",
"zoo"), .indexFORMAT = "%U-%Y", index = structure(c(1297642226,
1297672737, 1297741204, 1297748893, 1297749513), tzone = "", tclass = c("POSIXlt",
"POSIXt")), .Dim = c(5L, 23L), .Dimnames = list(NULL, c("rev_sit",
"prof_sit", "emp_nr_sit", "inv_sit", "ord_home_sit", "ord_abr_sit",
"emp_cost_sit", "usage_cost_sit", "tax_cost_sit", "gov_cost_sit",
"rev_exp", "prof_exp", "emp_nr_exp", "inv_exp", "ord_home_exp",
"ord_abr_exp", "emp_cost_exp", "usage_cost_exp", "tax_cost_exp",
"gov_cost_exp", "land", "nace", "index"))))
列
“rev_sit”,“prof_sit”,“emp_nr_sit”,“inv_sit”,“ord_home_sit”,“ord_abr_sit”,“emp_cost_sit”,“usage_cost_sit” 、“tax_cost_sit”、 "gov_cost_sit","rev_exp", "prof_exp", "emp_nr_exp", "inv_exp", "ord_home_exp","ord_abr_exp", "emp_cost_exp", "usage_cost_exp","tax_cost_exp","gov_cost_exp",
参考调查中的问题。有“1”、“2”、“3”三种回答可能性代码。
列
“land”、“nace”
分别是具有 16 个和 8 个独特因素的类别。
我的目标 我的目标是每周计算“nace”和“land”中类别因素的每个组合中“1”、“2”和“3”的出现次数。我的想法是预先为每个回答可能性 {1,2,3} 创建二进制向量(example_1,example_2,example_2),然后应用类似的内容:
apply.weekly(example_1, function(d){ddply(d,list(example$nace,example$land),sum)})
但这不适用于 ddply,< code>aggregate、by
等
我的目标
我最初的非专业工作不是创建时间序列,只是创建一个日期向量example$date与给定的时间列通过
%V
编码为每周,然后使用 ie:
tapply(example_1[,5], list(example$date,example$nace,example$land),sum)
,我当然必须对上面显示的二十个问题中的每个问题都这样做。然后我得到 ie 例如_1:
week1, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16 周2,nace1.land1,nace1.land2,nace1.land3,...,nace1.land16,nace2.land1,..,nace8.land16 ... ... weekn, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16
我必须对 2 (example_2) 和 3 (example_3) 做同样的事情这对于 20 个问题中的每一个问题都会产生 16*8*3*20=7680 列。这种极端情况以及使用这种方法的产品不是时间序列,因此不能按周正确排序。
总结
那么任何人都可以教我或给我一个提示如何将函数 apply.weekly()
与诸如 tapply()
之类的函数结合使用code>、ddply()
、by()
、split()
、unstack()
等或任何其他实现如上所述分组的方法。每一个提示都非常感谢。我很沮丧,已经考虑放弃我的 R 实验并改回 stata,其中很多事情都通过 collapse()
和 by()
等更加直观。但请不要误解我的意思:我很想学习,所以请帮助我!
Intro
I am not an R expert yet so please excuse another question which I probably should be embarassed of to ask. In another question I asked on stackoverflow I got some very helpful comments on how to aggregate unregularly daily data of an xts object to weekly values by theapply.weekly()
function. Unfortunately I didn't find a function like tapply()
, ddply()
, by()
or aggregate()
which allows to split up by categories which works together with the apply.weekly()
function.
My Data
This is my example dataset. I already posted in the other question. For illustration purposes I am taking the liberty to also post it here:
example <- as.data.frame(structure(c(" 1", " 2", " 1", " 2", " 1", " 1", " 2", " 1", " 2",
" 1", " 2", " 3", " 1", " 1", " 2", " 2", " 3", " 1", " 2", " 2",
" 1", " 2", " 1", " 1", " 2", NA, " 2", NA, NA, " 1", " 3", " 1",
" 3", " 3", " 2", " 3", " 3", " 3", " 2", " 2", " 2", " 3", " 3",
" 3", " 2", " 2", " 3", " 3", " 3", " 3", " 1", " 2", " 1", " 2",
" 2", " 1", " 2", " 1", " 2", " 2", " 2", " 3", " 1", " 1", " 2",
" 2", " 3", " 3", " 2", " 2", " 1", " 2", " 1", " 1", " 2", NA,
" 2", NA, NA, " 1", " 3", " 2", " 3", " 2", " 0", " 3", " 3",
" 3", " 2", " 0", " 2", " 3", " 3", " 3", " 0", " 2", " 2", " 3",
" 3", " 0", "12", " 5", " 9", "14", " 5", "tra", "tra", "man",
"inf", "agc", "07-2011", "07-2011", "07-2011", "07-2011", "07-2011"
), .indexCLASS = c("POSIXlt", "POSIXt"), .indexTZ = "", class = c("xts",
"zoo"), .indexFORMAT = "%U-%Y", index = structure(c(1297642226,
1297672737, 1297741204, 1297748893, 1297749513), tzone = "", tclass = c("POSIXlt",
"POSIXt")), .Dim = c(5L, 23L), .Dimnames = list(NULL, c("rev_sit",
"prof_sit", "emp_nr_sit", "inv_sit", "ord_home_sit", "ord_abr_sit",
"emp_cost_sit", "usage_cost_sit", "tax_cost_sit", "gov_cost_sit",
"rev_exp", "prof_exp", "emp_nr_exp", "inv_exp", "ord_home_exp",
"ord_abr_exp", "emp_cost_exp", "usage_cost_exp", "tax_cost_exp",
"gov_cost_exp", "land", "nace", "index"))))
The columns
"rev_sit", "prof_sit", "emp_nr_sit", "inv_sit", "ord_home_sit", "ord_abr_sit", "emp_cost_sit", "usage_cost_sit", "tax_cost_sit", "gov_cost_sit","rev_exp", "prof_exp", "emp_nr_exp", "inv_exp", "ord_home_exp","ord_abr_exp", "emp_cost_exp", "usage_cost_exp","tax_cost_exp","gov_cost_exp",
refer to questions in a survey. There were three answering possibilities codes as "1", "2", and "3".
The columns
"land", "nace"
are categories with 16 and 8 unique factors respectively.
My goal
My goal is to count the occurrence of "1", "2", and "3" each by week for each combination of the category factors in "nace" and "land". My idea was to create binary vectors for each answering possibility {1,2,3} beforehand (example_1,example_2,example_2) and then apply something like:
apply.weekly(example_1, function(d){ddply(d,list(example$nace,example$land),sum)})
But this doesn't work neither with ddply
, aggregate
, by
etc.
My goal
My unprofessional work around initially was not to create a time series, just a date vector example$date
with the given time column coded as weekly via %V
an then to use i.e:
tapply(example_1[,5], list(example$date,example$nace,example$land),sum)
which I would of course than have to do for every out of the above displayed twenty questions. I then get i.e. for example_1:
week1, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16
week2, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16
...
...
weekn, nace1.land1, nace1.land2, nace1.land3, ..., nace1.land16, nace2.land1,..,nace8.land16
The same I would have to do for 2 (example_2) and 3 (example_3) and this for each of the 20 questions to produce all in all 16*8*3*20=7680 columns. This extreme and additionally with this method the product is not a time series and thus it is not ordered correctly by week.
Summary
So can anyone teach me or give me a hint how to use the function apply.weekly()
in combination with functions the sort of tapply()
, ddply()
, by()
, split()
, unstack()
etc. or any other method to achieve grouping like described above. Every hint is really appreciated. I am so frustrated already thinking about to abandon my R experiment and changing back to stata where so many things are much more intuitive with collapse()
and by()
etc... But don't understand me wrong: I am keen to learn so please help me!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
非常感谢您的帮助。与此同时,我正忙着处理其他事情,但现在我又开始解决我的问题了,在您精彩评论的帮助下,我找到了解决方案:
我放弃了直接处理时间序列,将这一步推迟到我的分析。因此,我采用日期向量并将其转换为周:
library(ISOweek)
d$index <- ISOweek(d$date)
(我使用 ISOweek 执行此操作,因为我使用的是 Windows)
,然后我使用 tapply 和 lapply 的组合。以下函数计算每个日历周 (d$index = t[[22]]) 以及两个分类列 t[[21]]、t[ 的每个组合的调查中肯定答案的数量(用 1 编码) [22]]。在同一步骤中,整个事情被转换成一个数据框:
groupweeksums <- function(x,t){as.data.frame(tapply((x==1)*1,list(t[[23]]) ,t[[21]],t[[22]]), function(d)sum(d,na.rm=TRUE)))}
==>x 代表特定列,t 代表数据框 (我不知道怎么办否则,因为在某一时刻我必须在数据帧的另一个位置处理一列,并且我想避免大量打字);如果 d 是数据框,则:
df <- groupweeksums(d,d)
为了使我不必对所有 20 个问题重复此过程,请使用 lapply:
df <- as.data.frame( lapply(df[,1:20],function(d)groupweeksums(d,euwifo)))
这给了我一个漂亮的数据框,其中包含我进一步分析所需的一切。感谢您的帮助,有了您有用的评论,我离解决方案越来越近了!
PS 我还将将此答案发布到我在 stackoverflow 上发布的与此相关的另一个问题。我希望这没有问题或违反这里的任何规则。
thank you very much for all your help. I was busy with some other stuff in the meanwhile but now I was working on my problem again, and with the help of your great comments I have found a solution:
I gave up working directly with time series, postponing this step to the end of my analysis. Therefore I take the date vector and transform it into weeks:
library(ISOweek)
d$index <- ISOweek(d$date)
(i do this with ISOweek since I am using Windows)
then I use a combination of tapply and lapply. The following function calculates the number of positive answers in the survey (coded by 1) for every calendar week (d$index = t[[22]]) and every combination of the two categorical columns t[[21]], t[[22]]. In the same step the whole thing is transformed into a data frame:
groupweeksums <- function(x,t){as.data.frame(tapply((x==1)*1,list(t[[23]],t[[21]],t[[22]]), function(d)sum(d,na.rm=TRUE)))}
==>x stands for the specific column, t for the data frame (i didn't know how to do that otherwise because at one point i have to address a column at the other the data frame and I wanted to avoid lots of typing) ; if d is the data frame then:
df <- groupweeksums(d,d)
in order that I don't have to repeat this procedure for all of my 20 questions is use lapply:
df <- as.data.frame(lapply(df[,1:20],function(d)groupweeksums(d,euwifo)))
This gives me a beautiful data frame with all I need for further analysis. Thanks for your help, with your helpful comments I came closer and closer to the solution!!!
P.S. I will also post this answer to the other question I posted on stackoverflow which was connected to this one. I hope this is no problem or against any rules here.
我会按照您的建议添加一个“周”列,但在处理之前将数据转换为高格式 - 如果需要,您可以在之后将其转换回时间序列。
同样,您可以使用
ddply
:您的
index
列包含当年的周数 (%Y-%U
):仅当所有日期都在同一日历年内时才有效。使用实际日期而不是周数可能更安全,例如,本周开始时的星期日 - 它还可以更轻松地将结果转换为时间序列。I would add a "week" column, as you suggest, but convert the data to a tall format before processing -- you can convert it back to a time series afterwards, if needed.
Equivalently, you could use
ddply
:Your
index
column contains the number of the week in the current year (%Y-%U
): this will only work if all the dates are within the same calendar year. It may be safer to use an actual date instead of the week number, for instance, the Sunday at the start of the current week -- it also makes it easier to turn the result into a time series.