我应该如何将某些NA值更改为在R中选择的指定字符串?
因此,对于我正在为工作而努力的内部R& d项目的一部分,我需要有效地编程为字符串分配某些na
值,bmndits
对于“本集未检测到的生物标志物”)。对于上下文,我在一家小型生物技术公司工作,我们提供的服务是我们从客户运行的实验中扫描了各种样本类型中存在的小型生物标志物(每个都有与其相关的唯一样本设置ID)。因此,他们将向我们发送样本,我们扫描数据中的各种生物标志物,然后我们返回热图和实际数据本身。
通常,客户会随着时间的推移进行多个实验,因此他们最终可以获取足够的相关数据。好吧,如果他们从感兴趣的各个人群中收集了足够的样本,他们会希望让我们合并并堆叠数据,因此所有数据都存储在一个不错的,最终的,合并的数据框架中。听起来很容易,对吧?好吧,问题在于,由于并非所有的生物标志物在每个研究中始终存在,所以nas
都会引入很多 。的确,在任何给定的研究中,一个人都可能有一个生物标志物,而另一个人则不会在其捐赠样本中检测到它,因此对于该特定的生物标志物的特定个人,它将是一个单个na < /code>条目(虽然有时可能会在一排可能发生) - 这很好,因为显然我们无法控制何时在给定个人中存在生物标志物,因为它是完全随机的。
不过问题是,当我们将数据彼此堆叠以创建此最终合并的数据框架时,当前,如果在给定的总体/样本集ID中未观察到生物标记物,则仅是大量的顺序Na
给定列中的值。在我看来,这不是很有描述性的,因此我正在尝试创建一个R函数,该功能将进入并将这些值从仅是常规的旧Na
值转换为说bmndits
,就是这样,当研究人员正在查看实际数据本身并想对其做事时,他们可以过滤出什么是真正的缺少值和值,而这些价值和值不存在,因为它们不存在在给定的人群中观察到。
因此,我创建了一些正在用于模拟数据的虚假数据,这些数据可能会从三个单独的实验中获得(这些数据存储在我在下面提供的代码中创建的三个“玩具”数据框中)。如果您在下面运行我创建的内容,则最终会导致一个“所有”数据框架由30个(假)个体的30个观察值组成,其中每个生物标志物是一个标记为“ x1”,“ x2”,等等。同样,由于这里的目的是尝试模拟真实数据,因此我做到了,以便有时在一组中存在生物标志物,而不是其他所有产品。这就是为什么列名不一样的原因,有些则具有其他名称。
# loading dplyr
library(dplyr)
# making a couple toy data frames
set.seed(42)
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
# assigning the names of the various "biomarkers" for this fake data
names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")
# adding a dummy SSID to each toy dataframe
toy_df1$SSID <- as.numeric(rep(24001, nrow(toy_df1))) # Sample set ID from the first study
toy_df2$SSID <- as.numeric(rep(24002, nrow(toy_df2))) # Sample set ID from the second study
toy_df3$SSID <- as.numeric(rep(24003, nrow(toy_df3))) # Sample set ID from the third study
# Creating the NA insertion/MakeNA() function I'll need
# to help simulate the randomness that the NA values have
# regarding where they exist in the data
NA_Insert_Inator <- function(x) {
x %>% mutate(
across(
starts_with("x"),
function(.x, probMiss) {
ifelse(runif(nrow(.)) < probMiss, NA, .x)
},
probMiss=0.1
)
)
}
# Using the above function to randomly replace values in each toy dataframe with NA
toy_df1 <- NA_Insert_Inator(toy_df1)
toy_df2 <- NA_Insert_Inator(toy_df2)
toy_df3 <- NA_Insert_Inator(toy_df3)
# merging the toy data sheets into the "Data All"-esque file;
# this takes each dataframe and stacks
# them on top of each other in sequential order of the SSIDs.
# (Also, lastly I move the SSID columns to be the last columns in the toy_data_all dataframe)
toy_data_all <- bind_rows(toy_df1, toy_df2, toy_df3)
toy_data_all <- toy_data_all %>% select(-SSID, SSID)
因此,如果运行上述代码,则最终应该得到看起来与此相似的东西:
我创建了以下R功能,以更改这些长条纹使它起作用。我可以启动函数,但是当我尝试将其应用于我的toy_data_all
数据框架时,我只会在返回中获得null
的值。我期望的是那些长条纹(特别是10
,因为这是每个研究中的虚假参与者的数量)na
值将更改为bmndits 。
我尝试进行操作的方式是基于为每个单个数据框架使用SSID的方法。具体来说,如果对于任何给定的列,则特定SSID的值都等于na
,请将其更改为Say BMNDITS
。我不确定这里出了什么问题,也许有一种更好,更有效的方法可以解决这个问题。在这里尝试:
BMNDITS_Inator <- function(freshly_merged_df){
some_new_df <- freshly_merged_df
for (i in unique(some_new_df[['SSID']])){
for (j in 1:ncol(some_new_df)){
if (all(is.na(some_new_df[which(some_new_df[['SSID']] == i), j]))){
some_new_df[which(some_new_df[['SSID']] == i), j] <- "BMNDITS"
}
}
}
但是,是的,这是我困住的地方,非常感谢任何人的帮助或投入。非常感谢!
So for part of the internal R&D project I'm working on for work, I need to efficiently and programmatically assign certain NA
values to the string, BMNDITS
(which stands for "Biomarker Not Detected in this Set"). For context, I work at a small biotech company where the service we provide is that we scan for small biomarkers present in various sample types from experiments being run by clients (which each have a unique sample set ID associated with them). So, they'll send us the samples, we scan the data for the various biomarkers, and then we return to them a heatmap and the actual data itself.
Oftentimes, clients run multiple experiments over time so they can eventually acquire enough relevant data. Well, if they collect enough samples from their various populations of interest, they'll want to have us merge and stack the data so all the data is stored in one nice, finalized, merged data frame. Sounds easy enough, right? Well, the issue is that because not all biomarkers are always present in each study, a lot of NAs
get introduced. It's true that in any given study, one individual may have a biomarker present and another won't have it detected in their donated sample, so for that particular individual for that particular biomarker, it'll just be a single NA
entry (sometimes a couple may occur in a row though) -- and that's fine because obviously we can't control when a biomarker will be present in a given individual since it's completely random.
The problem though is that when we stack the data on top of each other to create this final merged data frame, currently, if a biomarker is not observed in a given population/sample set ID, it'll just be a large amount of sequential NA
values in a given column. This isn't very descriptive, in my opinion, and so I'm trying to create an R function that will go in and change those values from just being a regular old NA
value to saying BMNDITS
, just so that way when the researchers are looking at the actual data itself and want to do things with it, they can filter out what are true missing values and values that don't exist solely because they weren't observed for that given population.
So, I've created some fake data I'm using to simulate data that we might get from three separate experiments (which are stored in the three "toy" data frames I've created in the code provided below). If you run what I've created below, it will result in one "all" data frame at the end that consists of 30 observations from 30 (fake) individuals, where each biomarker is a column labeled "x1", "x2", etc. Again, since the point here is to try and simulate real data, I've made it so that sometimes a biomarker is present in one set and not all the others. This is why the column names aren't all the same and some have names that aren't present in the others.
# loading dplyr
library(dplyr)
# making a couple toy data frames
set.seed(42)
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
# assigning the names of the various "biomarkers" for this fake data
names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")
# adding a dummy SSID to each toy dataframe
toy_df1$SSID <- as.numeric(rep(24001, nrow(toy_df1))) # Sample set ID from the first study
toy_df2$SSID <- as.numeric(rep(24002, nrow(toy_df2))) # Sample set ID from the second study
toy_df3$SSID <- as.numeric(rep(24003, nrow(toy_df3))) # Sample set ID from the third study
# Creating the NA insertion/MakeNA() function I'll need
# to help simulate the randomness that the NA values have
# regarding where they exist in the data
NA_Insert_Inator <- function(x) {
x %>% mutate(
across(
starts_with("x"),
function(.x, probMiss) {
ifelse(runif(nrow(.)) < probMiss, NA, .x)
},
probMiss=0.1
)
)
}
# Using the above function to randomly replace values in each toy dataframe with NA
toy_df1 <- NA_Insert_Inator(toy_df1)
toy_df2 <- NA_Insert_Inator(toy_df2)
toy_df3 <- NA_Insert_Inator(toy_df3)
# merging the toy data sheets into the "Data All"-esque file;
# this takes each dataframe and stacks
# them on top of each other in sequential order of the SSIDs.
# (Also, lastly I move the SSID columns to be the last columns in the toy_data_all dataframe)
toy_data_all <- bind_rows(toy_df1, toy_df2, toy_df3)
toy_data_all <- toy_data_all %>% select(-SSID, SSID)
So if you run the above code you should end up getting something that looks similar to this:
I've created the following R function to try and change these long streaks of NA
values but I can't get it to work. I can initiate the function fine, but when I try to apply it to my toy_data_all
data frame I just get a value of NULL
in return. What I was expecting though was those long streaks of (specifically 10
since that's the number of fake participants in each study) NA
values would be changed to the specified string of BMNDITS
.
The way I have tried going about it is based off of using the SSID for each individual data frame. Specifically, if for any given column, if the values for a specific SSID are all equal to NA
, change them to say BMNDITS
. I'm not sure what's going wrong here and perhaps there is a better and more efficient way of going about this. Attempt here:
BMNDITS_Inator <- function(freshly_merged_df){
some_new_df <- freshly_merged_df
for (i in unique(some_new_df[['SSID']])){
for (j in 1:ncol(some_new_df)){
if (all(is.na(some_new_df[which(some_new_df[['SSID']] == i), j]))){
some_new_df[which(some_new_df[['SSID']] == i), j] <- "BMNDITS"
}
}
}
But yeah, this is where I'm stuck and would greatly appreciate anybody's help or input. Many thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们可以通过'SSID'分组使用一个组,在上
中的所有列(
,如果 >,afterts()
)上循环,然后检查所有
值是na
,然后替换为“ bmndits”
或else> else
返回字符转换值(如示例所示,这些列是数字
类)We may use a group by approach - grouped by 'SSID', loop over all the columns (
everything()
) inacross
, then checkif
,all
the values areNA
, then replace with"BMNDITS"
orelse
return the character converted value (as the example showed the columns arenumeric
class)基本上@akrun做了什么,但仅使用base r:
Basically what @akrun did but only use base R: