将一个字符串分成不同行上的多个字符串

发布于 2024-12-09 12:16:56 字数 993 浏览 0 评论 0原文

我有一个数据框，其中包含一个长字符串，每个字符串都与一个“样本”相关联：

Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N

我想编写一种简单的方法来将该字符串分成以下格式的 5 部分：

Sample X
CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168

为每个样本提供如下所示的输出：

Sample 1
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N

我已经能够使用 substr 函数将长字符串分成单独的部分，但我希望能够将其自动化，这样我就可以在一个输出中获得所有 5 个部分。理想情况下，该输出也是一个数据框。

原文

I have a data frame that contains a long character string each associated with a 'Sample':

Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N

I would like to code an easy way to break this string into 5 pieces in the following format:

Sample X
CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168

Giving an output that looks like this for each sample:

Sample 1
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N

I've been able to use the substr function to break the long string into individual pieces but id like to able to automate it so I can get all 5 pieces in one output. Ideally this output would also be a data frame.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦罢 2024-12-16 12:16:56

这就是 ?read.fwf 的用途。

首先是一些看起来像您的问题的数据：

x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N", 
"000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"), 
stringsAsFactors = FALSE)

现在使用read.fwf，指定每个字段的宽度及其名称，并且所有字段都应该是character模式。我们将示例数据的文本列包装在 textConnection 中，以便我们可以将其视为通常由 read.* 和其他函数理解的连接。

(strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15")))


                               CCT6                                GAT1                            IMD3                            PDR3                                  RIM15
1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N

现在循环遍历各行并根据您的示例打印出每一行：

for (i in 1:nrow(strs)) {
  writeLines(paste("Sample", i))
  writeLines(paste(names(strs), strs[i, ], sep = " - "))
}

例如，给出：

Sample 2
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N

This is what ?read.fwf is for.

First some data which looks like your question:

x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N", 
"000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"), 
stringsAsFactors = FALSE)

Now use read.fwf, specify the widths of each field and their names, and that all should be of mode character. We wrap the text column of the example data in textConnection so that we can treat it like a connection understood generally by the read.* and other functions.

(strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15")))


                               CCT6                                GAT1                            IMD3                            PDR3                                  RIM15
1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N

Now loop over the rows and print out each one as per your example:

for (i in 1:nrow(strs)) {
  writeLines(paste("Sample", i))
  writeLines(paste(names(strs), strs[i, ], sep = " - "))
}

Giving, for example:

Sample 2
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N

回复收藏 0 原文

云仙小弟 2024-12-16 12:16:56

SampX <- textConnection("CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168")
dfSampX <-read.table(SampX, sep="-")
dfSampX$V4 <- as.numeric(sub("Characters ", "", dfSampX$V2))

sampdat <- read.table(textConnection("Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
"), header=TRUE,stringsAsFactors=FALSE)

此代码将分成片段：

 apply(dfSampX[,c(3,4)], 1, function(x) substr(sampdat[,2], x["V4"], x["V3"]) )
     [,1]                                [,2]                                 
[1,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
[2,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
     [,3]                              [,4]                             
[1,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
[2,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
     [,5]                                    
[1,] "0000000000000000000N000000N0000000000N"
[2,] "0000000000000000000N000000N0000000000N"

此代码将以列表格式传递片段：

res <- lapply(sampdat$Data, function(x) 
           apply(dfSampX[,c(3,4)], 1, function(y) substr(x, y["V4"], y["V3"]) ))

res2 <- lapply(res, function(x){ names(x) <- dfSampX$V1 ; return(x)} )
res2

[[1]]
                                   CCT6                                     GAT1  
     "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                   IMD3                                     PDR3  
       "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                  RIM15  
"0000000000000000000N000000N0000000000N" 

[[2]]
                                   CCT6                                     GAT1  
     "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                   IMD3                                     PDR3  
       "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                  RIM15  
"0000000000000000000N000000N0000000000N"

并获取指定的输出格式：

 for (samp in seq_along(res2) ) { cat("Sample ", samp, "\n")
         invisible( sapply(1:5, function(y) 
            cat(as.character(dfSampX$V1[y]), " - ", res2[[samp]][y], "\n") ) ) }
Sample  1 
CCT6   -  000000000000000000000000000N01000 
GAT1   -  000000000N0N000000000N00N0000NN00N0 
IMD3   -  N000000100000N00N0N0000000NNNN0 
PDR3   -  1111111111111111111111111111111 
RIM15   -  0000000000000000000N000000N0000000000N 
Sample  2 
CCT6   -  000000000000000000000000000N01000 
GAT1   -  000000000N0N000000000N00N0000NN00N0 
IMD3   -  N000000100000N00N0N0000000NNNN0 
PDR3   -  1111111111111111111111111111111 
RIM15   -  0000000000000000000N000000N0000000000N

需要 invisible 来抑制列表结构中的 NULL 返回。

SampX <- textConnection("CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168")
dfSampX <-read.table(SampX, sep="-")
dfSampX$V4 <- as.numeric(sub("Characters ", "", dfSampX$V2))

sampdat <- read.table(textConnection("Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
"), header=TRUE,stringsAsFactors=FALSE)

This code will break into segments:

 apply(dfSampX[,c(3,4)], 1, function(x) substr(sampdat[,2], x["V4"], x["V3"]) )
     [,1]                                [,2]                                 
[1,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
[2,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
     [,3]                              [,4]                             
[1,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
[2,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
     [,5]                                    
[1,] "0000000000000000000N000000N0000000000N"
[2,] "0000000000000000000N000000N0000000000N"

This code would deliver the fragments in list format:

res <- lapply(sampdat$Data, function(x) 
           apply(dfSampX[,c(3,4)], 1, function(y) substr(x, y["V4"], y["V3"]) ))

res2 <- lapply(res, function(x){ names(x) <- dfSampX$V1 ; return(x)} )
res2

[[1]]
                                   CCT6                                     GAT1  
     "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                   IMD3                                     PDR3  
       "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                  RIM15  
"0000000000000000000N000000N0000000000N" 

[[2]]
                                   CCT6                                     GAT1  
     "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                   IMD3                                     PDR3  
       "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                  RIM15  
"0000000000000000000N000000N0000000000N"

And to get the specified output format:

 for (samp in seq_along(res2) ) { cat("Sample ", samp, "\n")
         invisible( sapply(1:5, function(y) 
            cat(as.character(dfSampX$V1[y]), " - ", res2[[samp]][y], "\n") ) ) }
Sample  1 
CCT6   -  000000000000000000000000000N01000 
GAT1   -  000000000N0N000000000N00N0000NN00N0 
IMD3   -  N000000100000N00N0N0000000NNNN0 
PDR3   -  1111111111111111111111111111111 
RIM15   -  0000000000000000000N000000N0000000000N 
Sample  2 
CCT6   -  000000000000000000000000000N01000 
GAT1   -  000000000N0N000000000N00N0000NN00N0 
IMD3   -  N000000100000N00N0N0000000NNNN0 
PDR3   -  1111111111111111111111111111111 
RIM15   -  0000000000000000000N000000N0000000000N

The invisible was needed to suppress the NULL returns from the list structure.

回复收藏 0 原文

~没有更多了~