根据字符重复次数分配值

发布于 2024-11-18 05:39:09 字数 1950 浏览 2 评论 0原文

抱歉，我问了一个又一个问题。我尽最大努力进行搜索，但我有一项艰巨的任务，即编写一个非常非常大的程序，而且我对 R 仍然很陌生，所以我很感谢迄今为止获得的所有快速帮助。

演示问题的假示例

Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
 A              1           2
 B             10           2
 C             20           2
 A              3           2
 B             NA          -1
 C             23           2
 A             NA          -1
 B             NA          -1
 C             22           2

ProceedTest 是一个分数，指示测试是否应该继续。 2 分表示测试会考虑数据，-1 分表示测试不会考虑数据。

您会注意到基因 B 的 NA 出现了两次，而基因 A 的 NA 只出现了一次。我希望 R 能够识别出对于基因 B，NA 出现了两次。这样，每当 NA 对于给定基因 (B) 出现两次时，零值就会替换 NA，随后的 -1 会变成 2。我希望 R 忽略 A 的 NA 并继续离开继续测试值按原样。

更改后的数据应如下所示：

Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20           2
  A              3           2
  B              0           2
  C             23           2
  A             NA          -1
  B              0           2
  C             22           2

这可能不可能，但如果可能，我希望能够说，如果该基因没有 NA，则 ProceedTest 值将变为 -1。

Final Dataset
 Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20          -1
  A              3           2
  B              0           2
  C             23          -1
  A             NA          -1
  B              0           2
  C             22          -1

总之。基因 A 只有一个 NA，因此没有任何变化。基因 B 有两个 NA 值，因此它全为 2，并且 NA 在强度值列中变为零。基因 C 变为 -1，因为它不包含任何 NA（对于改变强度值并不重要）。

我希望这一点很清楚，我也知道我的其他问题要容易一些，所以我希望这个特定的问题不是那么简单，我应该做更多的研究来自己找到答案。

感谢您提前提供的帮助，

乔

原文

Sorry for the burst of question after question. Trying my best to search, but I have the arduous task of coming up with a very, very large program and I am still very new to R so I appreciate all the quick help I have got thus far.

Fake example to demonstrate Problem

Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
 A              1           2
 B             10           2
 C             20           2
 A              3           2
 B             NA          -1
 C             23           2
 A             NA          -1
 B             NA          -1
 C             22           2

ProceedTest is a score that indicates whether the test should proceed. A score of 2 means it will take the data into account, a score of -1 means that the test will not take the data into account.

You'll notice that the gene B has NA appear twice, and A has NA appear only once. I would like R to be able to recognize that for gene B, NA appears twice. Such that any time NA appears twice for a given gene (B), a value of zero replaces the NA, and the subsequent -1 is turned into a 2. I want R to ignore the NA for A and continue to leave the Proceed test values as is.

The changed data should look like:

Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20           2
  A              3           2
  B              0           2
  C             23           2
  A             NA          -1
  B              0           2
  C             22           2

This may not be possible, but if it is, I would like to be able to say that if there are no NA's for the gene then the ProceedTest value becomes a -1.

Final Dataset
 Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20          -1
  A              3           2
  B              0           2
  C             23          -1
  A             NA          -1
  B              0           2
  C             22          -1

In summary. Gene A has only one NA, so nothing changes. Gene B has two NA values so it gets all 2's, and the NA's become zeros in the intensity value column. Gene C becomes a -1 because it does not contain any NA (doesn't really matter to change intensity values).

I hope this is clear, I also know that my other questions have been a little bit easier, so I hope this particular question isn't so straightforward where I should have done more research to find the answer on my own.

Thanks for the help in advance,

Joe

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你的笑 2024-11-25 05:39:10

如果您不关心 data.frame 的顺序，则 plyr 包中的 ddply 可以解决问题：

ddply(ExampleData, "Gene", function(dfr){
        #here, dfr is the part of your original data.frame
        #only for the 'current value' of Gene
        numNA<-sum(is.na(dfr$IntensityValue))
        if(numNA>1)
        {
            dfr$IntensityValue<-0
            dfr$ProceedTest<-2
        }
        else if(numNA==0)
        {
            dfr$ProceedTest<- -1
        }
        dfr
    })

不过还有许多其他解决方案。

If you don't care about the order of your data.frame, ddply from the plyr package can do the trick:

ddply(ExampleData, "Gene", function(dfr){
        #here, dfr is the part of your original data.frame
        #only for the 'current value' of Gene
        numNA<-sum(is.na(dfr$IntensityValue))
        if(numNA>1)
        {
            dfr$IntensityValue<-0
            dfr$ProceedTest<-2
        }
        else if(numNA==0)
        {
            dfr$ProceedTest<- -1
        }
        dfr
    })

There are many other solutions though.

回复收藏 0 原文

多像笑话 2024-11-25 05:39:10

需要注意的是，几乎肯定有更有效的方法来做到这一点（如果您的数据每个基因都有很多重复，则包含计数的非常压缩的 data.frame 的合并操作的重复将占用大量内存）：

Gene <- c("A","B","C","A","B","C","A","B","C")
IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
ExampleData <- as.data.frame(ExampleData)
ExampleData

num.na <- function(x) {
  sum(is.na(x))
}
ED.numna <- by(data=ExampleData,Gene,num.na)
# res.name is what you want the result column to be named
  #ideally would pull this from the call via something like as.character(attr(x,"call"))
as.data.frame.by <- function(x,res.name=NA) {
  stopifnot(length(dimnames(x))==1) # Only 1d case handled for now
  df <- data.frame(by = names(x), res = as.numeric(x) )
  names(df)[names(df)=="by"] <- names(dimnames(x))
  if(!is.na(res.name)) {
    names(df)[names(df)=="res"] <- res.name
  }
  df
}
ExampleData <- merge(ExampleData,as.data.frame(ED.numna,"count"))
ExampleData$IntensityValue[ExampleData$count > 1] <- 0

With the caveat that there are almost certainly more efficient ways of doing this (if your data has many repeats for each gene, the merge operation's duplication of a very condensed data.frame containing the counts will eat up a lot of memory):

Gene <- c("A","B","C","A","B","C","A","B","C")
IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
ExampleData <- as.data.frame(ExampleData)
ExampleData

num.na <- function(x) {
  sum(is.na(x))
}
ED.numna <- by(data=ExampleData,Gene,num.na)
# res.name is what you want the result column to be named
  #ideally would pull this from the call via something like as.character(attr(x,"call"))
as.data.frame.by <- function(x,res.name=NA) {
  stopifnot(length(dimnames(x))==1) # Only 1d case handled for now
  df <- data.frame(by = names(x), res = as.numeric(x) )
  names(df)[names(df)=="by"] <- names(dimnames(x))
  if(!is.na(res.name)) {
    names(df)[names(df)=="res"] <- res.name
  }
  df
}
ExampleData <- merge(ExampleData,as.data.frame(ED.numna,"count"))
ExampleData$IntensityValue[ExampleData$count > 1] <- 0

回复收藏 0 原文

~没有更多了~