在数据框中的重复数据之间进行选择

发布于 2024-12-11 05:50:54 字数 1513 浏览 0 评论 0原文

早些时候，我问了一个关于从数据框中提取重复行的问题。我现在需要运行一个脚本来决定将哪些重复项保留在我的最终数据集中。

该数据集中的重复条目具有相同的“测定”和“样品”值。这是我正在处理的新数据集的前 10 行，其中包含我的重复条目：

     Assay   Sample    Genotype   Data
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

我想运行一个脚本，根据“数据”的值将这些重复样本分成 4 个容器，该值可以是 1、0、或 NA：

 1) All values for 'Data' are NA
 2) All values for 'Data' are identical, no NA
 3) At least 1 value for 'Data' is not identical, no NA.
 4) At least 1 value for 'Data' is not identical, at least one is NA.

上述数据的预期结果如下所示；

Set 1
Null

Set 2
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0

Set 3
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1

Set 4
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

在某些情况下，该数据集中存在超过 2 个“重复”数据点。我什至不确定从哪里开始，因为我是 R 的新手。

编辑：使用预期数据。

原文

Earlier I asked a question about extracting duplicate lines from a data frame. I now need to run a script to decide which of these duplicates to keep in my final data set.

Duplicate entries in this data set have the same 'Assay' and 'Sample' values. Here is the first 10 lines of the new data set Im working with containing my duplicate entries:

     Assay   Sample    Genotype   Data
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

Id like to run a script that will break these duplicate samples into 4 bins based on the value for 'Data' which can be either 1, 0, or NA:

 1) All values for 'Data' are NA
 2) All values for 'Data' are identical, no NA
 3) At least 1 value for 'Data' is not identical, no NA.
 4) At least 1 value for 'Data' is not identical, at least one is NA.

The expected result from the above data would look like this;

Set 1
Null

Set 2
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0

Set 3
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1

Set 4
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

There are cases in which more than 2 "duplicate" data points exist in this data set. Im not sure even where to start with this as Im a newbie to R.

EDIT: With expected data.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑看君怀她人 2024-12-18 05:50:54

你提出的问题转向了要求别人为你完成全部工作的方向。关于该项目的单个特定部分的问题可能更有可能吸引回应。您正在努力解决的阻碍您开始的问题是一项非常基本的编程技能：将问题分解为小的具体步骤，单独解决每个步骤，然后再次将它们组合在一起以解决原始问题的能力< /em>。

不过，这项技能也很难学。但你有一个好的开始！您已经很好地指定了数据可以分为的四组：

“数据”的所有值均为 NA
“数据”的所有值均为 NA
相同，无 NA
“数据”至少有 1 个值不相同，无
不适用。
“数据”至少有 1 个值不相同，至少有一个值不相同
不适用。

现在你需要考虑一下，如果你只有一个数据子集，你能弄清楚如何在 R 中确定它属于哪一组 (1-4)？以下是一些可能对执行此操作有用的工具的草图。构建一些子集并在控制台中进行操作，直到您可以轻松地识别每个组：

(1) 所有值都是 datSub$Data NA 吗？

工具：all 和is.na

(2) 只有一个唯一值，不是NA？

工具：length、unique、is.na、any

(3) 多个唯一值，无 <代码>NAs？

工具：length、unique、any、is.na

(4) 至少有多个唯一值一个NA？

工具：length、unique、any、is.na

不使用所有这些工具也可以做到这一点功能，但它们都有潜在的用处。

一旦您知道如何确定特定子集应属于哪个组，您就可以将该代码包装到函数中。我的建议是创建一个值为 1-4 的新列，具体取决于该子集属于哪个组：

myFun <- function(x){
    if (...){
        x$grp <- 1
    }
    if (...){
        x$grp <- 2
    }
    #etc.

    return(x)
}

然后使用 ddply 将此函数根据 < 的值应用于数据的每个子集code>Sample：

ddply(dat,.(Sample),.fun = myFun)

最后将此数据框拆分到新的 grp 变量上：

split(dat,dat$grp)

希望这个通用草图可以帮助您入门。但你将会遇到问题。每个人都这样做。如果您在此过程中遇到具体问题，请随时提出其他问题。

事实上，我现在看到约翰已经按照我的草图发布了答案。不过，我还是会发布这个答案，希望它能帮助您分析未来的问题。

You have asked a question that veers in the direction of asking others to do your entire work for you. A question about a single, specific piece of this project would probably be more likely to attract a response. The piece you are struggling with that is preventing you from starting is a very basic programming skill: the ability to break your problem down into small concrete steps, solve each one individually and then put them together again to solve your original problem.

That skill is also very hard to learn, though. But you have a good start! You have nicely specified the four groups your data can fall into:

All values for 'Data' are NA
All values for 'Data' are
identical, no NA
At least 1 value for 'Data' is not identical, no
NA.
At least 1 value for 'Data' is not identical, at least one is
NA.

Now you need to think about how, if you have just one subset of your data, can you figure out how to determine in R which group (1-4) it is in? The following is a sketch of some tools that might be useful for doing this. Build a few subsets and play around in the console until you feel comfortable identifying each group:

(1) Are all values for datSub$Data NAs?

Tools: all and is.na

(2) Only one unique value, not NA?

Tools: length, unique, is.na, any

(3) More than one unique value, no NAs?

Tools: length, unique, any, is.na

(4) More than one unique value, at least one NA?

Tools: length, unique, any, is.na

It may be possible to do this without using all these functions, but they are all potentially useful.

Once you know how to determine which group a particular subset should be in, you are ready to wrap that code into a function. My suggestions would be to create a new column with the value 1-4 depending on which group that subset falls in:

myFun <- function(x){
    if (...){
        x$grp <- 1
    }
    if (...){
        x$grp <- 2
    }
    #etc.

    return(x)
}

Then use ddply to apply this function to each subset of your data based on the values of Sample:

ddply(dat,.(Sample),.fun = myFun)

And finally split this data frame on its new grp variable:

split(dat,dat$grp)

Hopefully, this general sketch helps to get you started. But you will have problems. Everyone does. If you run into specific problems along the way, feel free to ask another question about that.

Indeed, I see now that John has posted an answer along the lines of my sketch. However, I will post this answer anyway in the hopes that it helps you to analyze future problems.

回复收藏 0 原文

难以启齿的温柔 2024-12-18 05:50:54

这应该是一个好的开始。根据数据集的长度，优化它以获得更好的速度可能值得也可能不值得。

require(plyr)

# Read data
data = read.table('data.txt', colClasses=c(NA, NA, 'character', NA, NA))

# Function to pick set
pickSet <- function(x) {
  if(all(is.na(x$Data))) {
    set = 1
  } else if(length(unique(x$Data)) == 1) {
    set = 2
  } else if(!any(is.na(x$Data))) {
    set = 3
  } else {
    set = 4
  }
  data.frame(Set=set)
}

# Identify Set for each combo of Assay and Sample
sets = ddply(data, c('Assay', 'Sample'), pickSet)

# Merge set info back with data
data = join(data, sets)

# Reformat to list
sets.list = lapply(1:4, function(x) data[data$Set==x,-5])

> sets.list
[[1]]
[1] Assay    Sample   Genotype Data    
<0 rows> (or 0-length row.names)

[[2]]
     Assay Sample Genotype Data
5 CCT6-002   0050        G    0
6 CCT6-002   0050        G    0

[[3]]
     Assay Sample Genotype Data
1 CCT6-002   1486        A    1
2 CCT6-002   1486        G    0
7 CCT6-015   0082        G    0
8 CCT6-015   0082        T    1

[[4]]
      Assay Sample Genotype Data
3  CCT6-002   1997        G    0
4  CCT6-002   1997     <NA>   NA
9  CCT6-015   0121        G    0
10 CCT6-015   0121     <NA>   NA

This should be a good start. Depending on how long your dataset is, it may or may not be worth it to optimize this for better speed.

require(plyr)

# Read data
data = read.table('data.txt', colClasses=c(NA, NA, 'character', NA, NA))

# Function to pick set
pickSet <- function(x) {
  if(all(is.na(x$Data))) {
    set = 1
  } else if(length(unique(x$Data)) == 1) {
    set = 2
  } else if(!any(is.na(x$Data))) {
    set = 3
  } else {
    set = 4
  }
  data.frame(Set=set)
}

# Identify Set for each combo of Assay and Sample
sets = ddply(data, c('Assay', 'Sample'), pickSet)

# Merge set info back with data
data = join(data, sets)

# Reformat to list
sets.list = lapply(1:4, function(x) data[data$Set==x,-5])

> sets.list
[[1]]
[1] Assay    Sample   Genotype Data    
<0 rows> (or 0-length row.names)

[[2]]
     Assay Sample Genotype Data
5 CCT6-002   0050        G    0
6 CCT6-002   0050        G    0

[[3]]
     Assay Sample Genotype Data
1 CCT6-002   1486        A    1
2 CCT6-002   1486        G    0
7 CCT6-015   0082        G    0
8 CCT6-015   0082        T    1

[[4]]
      Assay Sample Genotype Data
3  CCT6-002   1997        G    0
4  CCT6-002   1997     <NA>   NA
9  CCT6-015   0121        G    0
10 CCT6-015   0121     <NA>   NA

回复收藏 0 原文

~没有更多了~