在数据框中的重复数据之间进行选择
早些时候,我问了一个关于从数据框中提取重复行的问题。我现在需要运行一个脚本来决定将哪些重复项保留在我的最终数据集中。
该数据集中的重复条目具有相同的“测定”和“样品”值。这是我正在处理的新数据集的前 10 行,其中包含我的重复条目:
Assay Sample Genotype Data
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
我想运行一个脚本,根据“数据”的值将这些重复样本分成 4 个容器,该值可以是 1、0、或 NA:
1) All values for 'Data' are NA
2) All values for 'Data' are identical, no NA
3) At least 1 value for 'Data' is not identical, no NA.
4) At least 1 value for 'Data' is not identical, at least one is NA.
上述数据的预期结果如下所示;
Set 1
Null
Set 2
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
Set 3
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
Set 4
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
在某些情况下,该数据集中存在超过 2 个“重复”数据点。我什至不确定从哪里开始,因为我是 R 的新手。
编辑:使用预期数据。
Earlier I asked a question about extracting duplicate lines from a data frame. I now need to run a script to decide which of these duplicates to keep in my final data set.
Duplicate entries in this data set have the same 'Assay' and 'Sample' values. Here is the first 10 lines of the new data set Im working with containing my duplicate entries:
Assay Sample Genotype Data
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
Id like to run a script that will break these duplicate samples into 4 bins based on the value for 'Data' which can be either 1, 0, or NA:
1) All values for 'Data' are NA
2) All values for 'Data' are identical, no NA
3) At least 1 value for 'Data' is not identical, no NA.
4) At least 1 value for 'Data' is not identical, at least one is NA.
The expected result from the above data would look like this;
Set 1
Null
Set 2
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
Set 3
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
Set 4
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
There are cases in which more than 2 "duplicate" data points exist in this data set. Im not sure even where to start with this as Im a newbie to R.
EDIT: With expected data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你提出的问题转向了要求别人为你完成全部工作的方向。关于该项目的单个特定部分的问题可能更有可能吸引回应。您正在努力解决的阻碍您开始的问题是一项非常基本的编程技能:将问题分解为小的具体步骤,单独解决每个步骤,然后再次将它们组合在一起以解决原始问题的能力< /em>。
不过,这项技能也很难学。但你有一个好的开始!您已经很好地指定了数据可以分为的四组:
“数据”的所有值均为 NA
“数据”的所有值均为 NA
相同,无 NA
“数据”至少有 1 个值不相同,无
不适用。
“数据”至少有 1 个值不相同,至少有一个值不相同
不适用。
现在你需要考虑一下,如果你只有一个数据子集,你能弄清楚如何在 R 中确定它属于哪一组 (1-4)?以下是一些可能对执行此操作有用的工具的草图。构建一些子集并在控制台中进行操作,直到您可以轻松地识别每个组:
(1) 所有值都是
datSub$Data
NA
吗?工具:
all
和is.na
(2) 只有一个唯一值,不是
NA
?工具:
length
、unique
、is.na
、any
(3) 多个唯一值,无 <代码>NAs?
工具:
length
、unique
、any
、is.na
(4) 至少有多个唯一值一个
NA
?工具:
length
、unique
、any
、is.na
不使用所有这些工具也可以做到这一点功能,但它们都有潜在的用处。
一旦您知道如何确定特定子集应属于哪个组,您就可以将该代码包装到函数中。我的建议是创建一个值为 1-4 的新列,具体取决于该子集属于哪个组:
然后使用 ddply 将此函数根据 < 的值应用于数据的每个子集code>Sample:
最后将此数据框拆分到新的
grp
变量上:希望这个通用草图可以帮助您入门。但你将会遇到问题。每个人都这样做。如果您在此过程中遇到具体问题,请随时提出其他问题。
事实上,我现在看到约翰已经按照我的草图发布了答案。不过,我还是会发布这个答案,希望它能帮助您分析未来的问题。
You have asked a question that veers in the direction of asking others to do your entire work for you. A question about a single, specific piece of this project would probably be more likely to attract a response. The piece you are struggling with that is preventing you from starting is a very basic programming skill: the ability to break your problem down into small concrete steps, solve each one individually and then put them together again to solve your original problem.
That skill is also very hard to learn, though. But you have a good start! You have nicely specified the four groups your data can fall into:
All values for 'Data' are NA
All values for 'Data' are
identical, no NA
At least 1 value for 'Data' is not identical, no
NA.
At least 1 value for 'Data' is not identical, at least one is
NA.
Now you need to think about how, if you have just one subset of your data, can you figure out how to determine in R which group (1-4) it is in? The following is a sketch of some tools that might be useful for doing this. Build a few subsets and play around in the console until you feel comfortable identifying each group:
(1) Are all values for
datSub$Data
NA
s?Tools:
all
andis.na
(2) Only one unique value, not
NA
?Tools:
length
,unique
,is.na
,any
(3) More than one unique value, no
NA
s?Tools:
length
,unique
,any
,is.na
(4) More than one unique value, at least one
NA
?Tools:
length
,unique
,any
,is.na
It may be possible to do this without using all these functions, but they are all potentially useful.
Once you know how to determine which group a particular subset should be in, you are ready to wrap that code into a function. My suggestions would be to create a new column with the value 1-4 depending on which group that subset falls in:
Then use
ddply
to apply this function to each subset of your data based on the values ofSample
:And finally split this data frame on its new
grp
variable:Hopefully, this general sketch helps to get you started. But you will have problems. Everyone does. If you run into specific problems along the way, feel free to ask another question about that.
Indeed, I see now that John has posted an answer along the lines of my sketch. However, I will post this answer anyway in the hopes that it helps you to analyze future problems.
这应该是一个好的开始。根据数据集的长度,优化它以获得更好的速度可能值得也可能不值得。
This should be a good start. Depending on how long your dataset is, it may or may not be worth it to optimize this for better speed.