在 R 中使用 Data.frames(使用 SAS 代码来描述我想要的)r
我最近主要在 SAS 工作,但不想失去对 RI 的熟悉程度,我想复制一些我做过的基本工作。如果我的 SAS 代码不完美,请原谅我,因为我家里没有 SAS,所以我凭记忆这样做。
在 SAS 中,我有一个数据集,大致类似于以下示例(. 相当于 SAS 中的 NA)
A B
1 1
1 3
0 .
0 1
1 0
0 0
如果上面的数据集是 work.foo 那么我可以执行如下操作。
/* create work.bar from dataset work.foo */
data work.bar;
set work.foo;
/* generate a third variable and add it to work.bar */
if a = 0 and b ge 1 then c = 1;
if a = 0 and b = 0 then c = 2;
if a = 1 and b ge 1 then c = 3;
if a = 1 and b = 0 then c = 4;
run;
我会得到类似的结果
A B C
1 1 3
1 3 3
0 . .
0 1 1
1 0 4
0 0 2
然后我可以按 C 进行排序,然后使用 C 执行各种操作来创建 4 个子组。例如,我可以获取每个组的平均值,
proc means noprint data =work.bar;
by c;
var a b;
output out = work.means mean(a b) = a b;
run;
并且可以按名为 work.means 的组获取变量数据 比如:
C A B
1 0 1
2 0 0
3 2 2
4 1 0
我想我也可能会得到一个。行,但出于我的目的,我并不关心这一点。
现在在 R 中。我有相同的数据集,已正确读取,但我不知道如何在末尾添加变量(如 CC)或如何对子组执行操作(如 proc 中的 by cc 命令)方法)。另外,我应该注意,我的变量不是按任何顺序命名的,而是根据它们所代表的内容命名的。
我想如果有人可以告诉我如何做到以上,我就可以将其概括为我需要做的事情。
I've been mostly working in SAS of late, but not wanting to lose what familiarity with R I have, I'd like to replicate something basic I've done. You'll forgive me if my SAS code isn't perfect, I'm doing this from memory since I don't have SAS at home.
In SAS I have a dataset that roughly is like the following example (. is equivalent of NA in SAS)
A B
1 1
1 3
0 .
0 1
1 0
0 0
If the dataset above was work.foo then I could do something like the following.
/* create work.bar from dataset work.foo */
data work.bar;
set work.foo;
/* generate a third variable and add it to work.bar */
if a = 0 and b ge 1 then c = 1;
if a = 0 and b = 0 then c = 2;
if a = 1 and b ge 1 then c = 3;
if a = 1 and b = 0 then c = 4;
run;
and I'd get something like
A B C
1 1 3
1 3 3
0 . .
0 1 1
1 0 4
0 0 2
And I could then proc sort by C and then perform various operations using C to create 4 subgroups. For example I could get the means of each group with
proc means noprint data =work.bar;
by c;
var a b;
output out = work.means mean(a b) = a b;
run;
and I'd get a data of variables by groups called work.means
something like:
C A B
1 0 1
2 0 0
3 2 2
4 1 0
I think I may also get a . row, but I don't care about that for my purposes.
Now in R. I have the same data set that's been read in properly, but I have no idea how to add a variable to the end (like CC) or how to do an operation on a subgroup (like the by cc command in proc means). Also, I should note that my variables aren't named in any sort of order, but according to what they represent.
I figure if somebody can show me how to do the above, I can generalize it to what I need to do.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设您的数据集是一个名为 work.foo 的两列数据框,其中包含变量 a 和 b。那么下面的代码是在 R 中执行此操作的一种方法:
Assume your data set is a two-column dataframe called work.foo with variables a and b. Then the following code is one way to do it in R:
另一种方法是使用 plyr 包中的 ddply() ,您甚至不必创建组变量(尽管这非常方便)。
当然,如果您有分组变量,只需将
c("a", "b")
替换为"c"
即可。在我看来,主要优点是
plyr
函数将返回您喜欢的任何类型的对象 - ddply 获取一个数据帧并返回一个数据帧,dlply 将返回一个列表等。by( )
及其 *apply 兄弟通常只给你一个列表。我认为。An alternative is to use
ddply()
from the plyr package - you wouldn't even have to create a group variable, necessarily (although that's awfully convenient).Of course, if you had the grouping variable, you'd just replace
c("a", "b")
with"c"
.The main advantage in my mind is that
plyr
functions will return whatever kind of object you like - ddply takes a data frame and gives you one back, dlply would return a list, etc.by()
and its *apply brethren usually just give you a list. I think.