根据字符重复次数分配值

发布于 2024-11-18 05:39:09 字数 1950 浏览 2 评论 0原文

抱歉,我问了一个又一个问题。我尽最大努力进行搜索,但我有一项艰巨的任务,即编写一个非常非常大的程序,而且我对 R 仍然很陌生,所以我很感谢迄今为止获得的所有快速帮助。

演示问题的假示例

Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
 A              1           2
 B             10           2
 C             20           2
 A              3           2
 B             NA          -1
 C             23           2
 A             NA          -1
 B             NA          -1
 C             22           2

ProceedTest 是一个分数,指示测试是否应该继续。 2 分表示测试会考虑数据,-1 分表示测试不会考虑数据。

您会注意到基因 B 的 NA 出现了两次,而基因 A 的 NA 只出现了一次。我希望 R 能够识别出对于基因 B,NA 出现了两次。这样,每当 NA 对于给定基因 (B) 出现两次时,零值就会替换 NA,随后的 -1 会变成 2。我希望 R 忽略 A 的 NA 并继续离开继续测试值按原样。

更改后的数据应如下所示:

Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20           2
  A              3           2
  B              0           2
  C             23           2
  A             NA          -1
  B              0           2
  C             22           2

这可能不可能,但如果可能,我希望能够说,如果该基因没有 NA,则 ProceedTest 值将变为 -1。

Final Dataset
 Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20          -1
  A              3           2
  B              0           2
  C             23          -1
  A             NA          -1
  B              0           2
  C             22          -1

总之。基因 A 只有一个 NA,因此没有任何变化。基因 B 有两个 NA 值,因此它全为 2,并且 NA 在强度值列中变为零。基因 C 变为 -1,因为它不包含任何 NA(对于改变强度值并不重要)。

我希望这一点很清楚,我也知道我的其他问题要容易一些,所以我希望这个特定的问题不是那么简单,我应该做更多的研究来自己找到答案。

感谢您提前提供的帮助,

Sorry for the burst of question after question. Trying my best to search, but I have the arduous task of coming up with a very, very large program and I am still very new to R so I appreciate all the quick help I have got thus far.

Fake example to demonstrate Problem

Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
 A              1           2
 B             10           2
 C             20           2
 A              3           2
 B             NA          -1
 C             23           2
 A             NA          -1
 B             NA          -1
 C             22           2

ProceedTest is a score that indicates whether the test should proceed. A score of 2 means it will take the data into account, a score of -1 means that the test will not take the data into account.

You'll notice that the gene B has NA appear twice, and A has NA appear only once. I would like R to be able to recognize that for gene B, NA appears twice. Such that any time NA appears twice for a given gene (B), a value of zero replaces the NA, and the subsequent -1 is turned into a 2. I want R to ignore the NA for A and continue to leave the Proceed test values as is.

The changed data should look like:

Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20           2
  A              3           2
  B              0           2
  C             23           2
  A             NA          -1
  B              0           2
  C             22           2

This may not be possible, but if it is, I would like to be able to say that if there are no NA's for the gene then the ProceedTest value becomes a -1.

Final Dataset
 Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20          -1
  A              3           2
  B              0           2
  C             23          -1
  A             NA          -1
  B              0           2
  C             22          -1

In summary. Gene A has only one NA, so nothing changes. Gene B has two NA values so it gets all 2's, and the NA's become zeros in the intensity value column. Gene C becomes a -1 because it does not contain any NA (doesn't really matter to change intensity values).

I hope this is clear, I also know that my other questions have been a little bit easier, so I hope this particular question isn't so straightforward where I should have done more research to find the answer on my own.

Thanks for the help in advance,

Joe

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

你的笑 2024-11-25 05:39:10

如果您不关心 data.frame 的顺序,则 plyr 包中的 ddply 可以解决问题:

ddply(ExampleData, "Gene", function(dfr){
        #here, dfr is the part of your original data.frame
        #only for the 'current value' of Gene
        numNA<-sum(is.na(dfr$IntensityValue))
        if(numNA>1)
        {
            dfr$IntensityValue<-0
            dfr$ProceedTest<-2
        }
        else if(numNA==0)
        {
            dfr$ProceedTest<- -1
        }
        dfr
    })

不过还有许多其他解决方案。

If you don't care about the order of your data.frame, ddply from the plyr package can do the trick:

ddply(ExampleData, "Gene", function(dfr){
        #here, dfr is the part of your original data.frame
        #only for the 'current value' of Gene
        numNA<-sum(is.na(dfr$IntensityValue))
        if(numNA>1)
        {
            dfr$IntensityValue<-0
            dfr$ProceedTest<-2
        }
        else if(numNA==0)
        {
            dfr$ProceedTest<- -1
        }
        dfr
    })

There are many other solutions though.

多像笑话 2024-11-25 05:39:10

需要注意的是,几乎肯定有更有效的方法来做到这一点(如果您的数据每个基因都有很多重复,则包含计数的非常压缩的 data.frame 的合并操作的重复将占用大量内存):

Gene <- c("A","B","C","A","B","C","A","B","C")
IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
ExampleData <- as.data.frame(ExampleData)
ExampleData

num.na <- function(x) {
  sum(is.na(x))
}
ED.numna <- by(data=ExampleData,Gene,num.na)
# res.name is what you want the result column to be named
  #ideally would pull this from the call via something like as.character(attr(x,"call"))
as.data.frame.by <- function(x,res.name=NA) {
  stopifnot(length(dimnames(x))==1) # Only 1d case handled for now
  df <- data.frame(by = names(x), res = as.numeric(x) )
  names(df)[names(df)=="by"] <- names(dimnames(x))
  if(!is.na(res.name)) {
    names(df)[names(df)=="res"] <- res.name
  }
  df
}
ExampleData <- merge(ExampleData,as.data.frame(ED.numna,"count"))
ExampleData$IntensityValue[ExampleData$count > 1] <- 0

With the caveat that there are almost certainly more efficient ways of doing this (if your data has many repeats for each gene, the merge operation's duplication of a very condensed data.frame containing the counts will eat up a lot of memory):

Gene <- c("A","B","C","A","B","C","A","B","C")
IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
ExampleData <- as.data.frame(ExampleData)
ExampleData

num.na <- function(x) {
  sum(is.na(x))
}
ED.numna <- by(data=ExampleData,Gene,num.na)
# res.name is what you want the result column to be named
  #ideally would pull this from the call via something like as.character(attr(x,"call"))
as.data.frame.by <- function(x,res.name=NA) {
  stopifnot(length(dimnames(x))==1) # Only 1d case handled for now
  df <- data.frame(by = names(x), res = as.numeric(x) )
  names(df)[names(df)=="by"] <- names(dimnames(x))
  if(!is.na(res.name)) {
    names(df)[names(df)=="res"] <- res.name
  }
  df
}
ExampleData <- merge(ExampleData,as.data.frame(ED.numna,"count"))
ExampleData$IntensityValue[ExampleData$count > 1] <- 0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文