在r dataframe中的列字符串之间获取差异

发布于 2025-02-01 11:20:10 字数 694 浏览 0 评论 0原文

我在R中有一个基本问题：

考虑到我有一个数据框架，每列将核苷酸突变的集合成两个样本“主要”和“次要”

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- data.frame(major,minor)
tibble(df)

#A tibble: 1 x 2
  major          minor               
  <chr>          <chr>               
1 T2A,C26T,G652A T2A,C26T,G652A,C725T

，我想确定“次要”中存在的突变不在“专业”中。

我知道，如果那些“主要”和“次要”突变是存储的矢量，我可以使用setdiff获得这种差异，但是，我收到的数据被存储为长字符串，其中有一些突变由逗号分隔，而我不喜欢t知道如何将此列字符串转换为数据框架中的列向量以获得此差异（我尝试了无成功）。

直接在列中使用SetDiff：

setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"

预期的结果是：

C725T

有人可以帮助我吗？

最好的，

原文

I'm with a fundamental question in R:

Considering that I have a data frame, where each column represent the set of nucleotide mutations into two samples 'major' and 'minor'

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- data.frame(major,minor)
tibble(df)

#A tibble: 1 x 2
  major          minor               
  <chr>          <chr>               
1 T2A,C26T,G652A T2A,C26T,G652A,C725T

And I want to identify the mutations present in 'minor' that aren't in 'major'.

I know that if those 'major' and 'minor' mutations were stored vectors, I could use setdiff to get this difference, but, the data that I received is stored as a long string with some mutations separated by comma, and I don't know how transform this column string to a column vector in the data frame to get this difference (I tried without success).

using the setdiff directly in the columns:

setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"

The expected results was:

C725T

Could anyone help me?

Best,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无声静候 2025-02-08 11:20:10

这在多行数据框架上起作用，按行进行比较：

library(dplyr)
major <- c("T2A,C26T,G652A", "world")
minor <- c("T2A,C26T,G652A,C725T", "hello,world")

df <- data.frame(major,minor)

df %>%
  mutate(
    across(c(major, minor), strsplit, split = ",")
  ) %>%
  mutate(
    diff = mapply(setdiff, minor, major)
  )
#              major                   minor  diff
# 1 T2A, C26T, G652A T2A, C26T, G652A, C725T C725T
# 2            world            hello, world hello

请注意，它确实修改了Major和Minor列，将它们变成每个内包含字符向量的列表列排。如果您需要保留原始内容，则可以将.names参数转换为。

This works on a multi-row data frame, doing comparisons by row:

library(dplyr)
major <- c("T2A,C26T,G652A", "world")
minor <- c("T2A,C26T,G652A,C725T", "hello,world")

df <- data.frame(major,minor)

df %>%
  mutate(
    across(c(major, minor), strsplit, split = ",")
  ) %>%
  mutate(
    diff = mapply(setdiff, minor, major)
  )
#              major                   minor  diff
# 1 T2A, C26T, G652A T2A, C26T, G652A, C725T C725T
# 2            world            hello, world hello

Note that it does modify the major and minor columns, turning them into list columns containing character vectors within each row. You can use the .names argument to across if you need to keep the originals.

回复收藏 0 原文

落日海湾 2025-02-08 11:20:10

最简单的方法；定义主要和minor作为字符vector

major＆lt; - c（“ t2a”，“ c26t”，“ g652a”）

和

Minor＆lt; - C（“ T2A”，“ C26T”，“ G652A”，“ C725T”）

，

df <- tibble(major, minor)
setdiff(df$minor, df$major)
#> "C725T"

如果不可能将主要和小型级分为字符向量，则可以使用stringr 包裹做这项工作。

library(stringr)

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- tibble(
  major = str_split(major, pattern = ",", simplify = TRUE), 
  minor = str_split(minor, pattern = ",", simplify = TRUE)
)

setdiff(df$minor, df$major)
#> "C725T"

Easiest way to do this; define major and minor as character vector

major <- c("T2A", "C26T", "G652A")

and

minor <- c("T2A", "C26T", "G652A", "C725T")

then

df <- tibble(major, minor)
setdiff(df$minor, df$major)
#> "C725T"

If not possible to split major and minor as character vector, you can use stringr package to do that job.

library(stringr)

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- tibble(
  major = str_split(major, pattern = ",", simplify = TRUE), 
  minor = str_split(minor, pattern = ",", simplify = TRUE)
)

setdiff(df$minor, df$major)
#> "C725T"

回复收藏 0 原文

~没有更多了~