根据字符重复次数分配值
抱歉,我问了一个又一个问题。我尽最大努力进行搜索,但我有一项艰巨的任务,即编写一个非常非常大的程序,而且我对 R 仍然很陌生,所以我很感谢迄今为止获得的所有快速帮助。
演示问题的假示例
Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
A 1 2
B 10 2
C 20 2
A 3 2
B NA -1
C 23 2
A NA -1
B NA -1
C 22 2
ProceedTest 是一个分数,指示测试是否应该继续。 2 分表示测试会考虑数据,-1 分表示测试不会考虑数据。
您会注意到基因 B 的 NA 出现了两次,而基因 A 的 NA 只出现了一次。我希望 R 能够识别出对于基因 B,NA 出现了两次。这样,每当 NA 对于给定基因 (B) 出现两次时,零值就会替换 NA,随后的 -1 会变成 2。我希望 R 忽略 A 的 NA 并继续离开继续测试值按原样。
更改后的数据应如下所示:
Gene IntensityValue ProceedTest
A 1 2
B 10 2
C 20 2
A 3 2
B 0 2
C 23 2
A NA -1
B 0 2
C 22 2
这可能不可能,但如果可能,我希望能够说,如果该基因没有 NA,则 ProceedTest 值将变为 -1。
Final Dataset
Gene IntensityValue ProceedTest
A 1 2
B 10 2
C 20 -1
A 3 2
B 0 2
C 23 -1
A NA -1
B 0 2
C 22 -1
总之。基因 A 只有一个 NA,因此没有任何变化。基因 B 有两个 NA 值,因此它全为 2,并且 NA 在强度值列中变为零。基因 C 变为 -1,因为它不包含任何 NA(对于改变强度值并不重要)。
我希望这一点很清楚,我也知道我的其他问题要容易一些,所以我希望这个特定的问题不是那么简单,我应该做更多的研究来自己找到答案。
感谢您提前提供的帮助,
乔
Sorry for the burst of question after question. Trying my best to search, but I have the arduous task of coming up with a very, very large program and I am still very new to R so I appreciate all the quick help I have got thus far.
Fake example to demonstrate Problem
Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
A 1 2
B 10 2
C 20 2
A 3 2
B NA -1
C 23 2
A NA -1
B NA -1
C 22 2
ProceedTest is a score that indicates whether the test should proceed. A score of 2 means it will take the data into account, a score of -1 means that the test will not take the data into account.
You'll notice that the gene B has NA appear twice, and A has NA appear only once. I would like R to be able to recognize that for gene B, NA appears twice. Such that any time NA appears twice for a given gene (B), a value of zero replaces the NA, and the subsequent -1 is turned into a 2. I want R to ignore the NA for A and continue to leave the Proceed test values as is.
The changed data should look like:
Gene IntensityValue ProceedTest
A 1 2
B 10 2
C 20 2
A 3 2
B 0 2
C 23 2
A NA -1
B 0 2
C 22 2
This may not be possible, but if it is, I would like to be able to say that if there are no NA's for the gene then the ProceedTest value becomes a -1.
Final Dataset
Gene IntensityValue ProceedTest
A 1 2
B 10 2
C 20 -1
A 3 2
B 0 2
C 23 -1
A NA -1
B 0 2
C 22 -1
In summary. Gene A has only one NA, so nothing changes. Gene B has two NA values so it gets all 2's, and the NA's become zeros in the intensity value column. Gene C becomes a -1 because it does not contain any NA (doesn't really matter to change intensity values).
I hope this is clear, I also know that my other questions have been a little bit easier, so I hope this particular question isn't so straightforward where I should have done more research to find the answer on my own.
Thanks for the help in advance,
Joe
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您不关心 data.frame 的顺序,则
plyr
包中的ddply
可以解决问题:不过还有许多其他解决方案。
If you don't care about the order of your data.frame,
ddply
from theplyr
package can do the trick:There are many other solutions though.
需要注意的是,几乎肯定有更有效的方法来做到这一点(如果您的数据每个基因都有很多重复,则包含计数的非常压缩的 data.frame 的合并操作的重复将占用大量内存):
With the caveat that there are almost certainly more efficient ways of doing this (if your data has many repeats for each gene, the merge operation's duplication of a very condensed data.frame containing the counts will eat up a lot of memory):