分开差异固定宽度字符串格式

发布于 2025-02-02 00:57:35 字数 623 浏览 3 评论 0原文

新手问题!我有一个列,带有两个不同的固定宽度格式的字符串。我们可以通过其名称识别格式的类型,并根据格式拆分字符串。

df <- data.frame(
var1 =  c('M1B123456789MM1158','M1C123456789zMM1183'),
var2 =  c('code1','code8'))

固定宽度格式是:

formatM1B = c(3,9,2,4)
formatM1C = c(3,9,1,2,4)

所以我希望这个结果:

 |format|var1_2   |var1_3|var1_5|var1_6|code |
1|M1B   |123456789|      |MM    |1158  |code1|
2|M1C   |123456789|z     |MM    |1183  |code8|

我尝试了函数独立 str_split str_split_fixed ,但我不知道如何组合它具有某种if函数来“测试”或“正则”字符串中提到的格式。 这个问题当然已经被问到了很多时间,我进行了数小时的研究,而没有找到适应我的数据的东西:/

Newbie question! I have a column with strings of two differents fixed widths formats. We can recognize the type of format by its name and split the string according to the format.

df <- data.frame(
var1 =  c('M1B123456789MM1158','M1C123456789zMM1183'),
var2 =  c('code1','code8'))

The fixed widths formats are:

formatM1B = c(3,9,2,4)
formatM1C = c(3,9,1,2,4)

So i hope this result:

 |format|var1_2   |var1_3|var1_5|var1_6|code |
1|M1B   |123456789|      |MM    |1158  |code1|
2|M1C   |123456789|z     |MM    |1183  |code8|

I tried the functions separate , str_split or str_split_fixed but i don't know how combine it with a sort of IF function to "test" or "regex" the format mentionned into the string.
This question has certainly been asked a lot of time, i did hours research without being able to find something to adapt to my data :/

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

み零 2025-02-09 00:57:35

如果当缺少“ z” “ z”时定义宽度,则可以使用专用 read.fwf 函数:

w <- list(M1B = c(3,9,0,2,4), M1C = c(3,9,1,2,4))

do.call(rbind, 
        lapply(df$var1, function(i){
               read.fwf(textConnection(i), widths = w[ substr(i, 1, 3) ])
  }))

#    V1        V2   V3 V4   V5
# 1 M1B 123456789 <NA> MM 1158
# 2 M1C 123456789    z MM 1183

If we define widths with zero when "z" is missing, then we can use dedicated read.fwf function:

w <- list(M1B = c(3,9,0,2,4), M1C = c(3,9,1,2,4))

do.call(rbind, 
        lapply(df$var1, function(i){
               read.fwf(textConnection(i), widths = w[ substr(i, 1, 3) ])
  }))

#    V1        V2   V3 V4   V5
# 1 M1B 123456789 <NA> MM 1158
# 2 M1C 123456789    z MM 1183
醉梦枕江山 2025-02-09 00:57:35
library(tidyverse)

df %>% 
   extract(col= var1,into = c('format','1','2','3','4'), 
          regex = "^(M[1-9][A-Z])([1-9]{9})(z)?(M{2})([1-9]{4})")

正则表达式有5个组:

  1. (M [1-9] [AZ]):搜索M,AT:1,...,9,和大写字母
  2. ([1-9] {9}):搜索9个INT数字:1,...,9
  3. (z)?:搜索是否有AZ或Skip
  4. (M {2}):搜索2 M
  5. ([1-9] {4}):搜索4 int数字:1,...,9

输出:

  format         1 2  3    4  var2
1    M1B 123456789   MM 1158 code1
2    M1C 123456789 z MM 1183 code8
library(tidyverse)

df %>% 
   extract(col= var1,into = c('format','1','2','3','4'), 
          regex = "^(M[1-9][A-Z])([1-9]{9})(z)?(M{2})([1-9]{4})")

The regex expresion has 5 groups:

  1. (M[1-9][A-Z]): Search for a M, a int: 1,...,9, and an uppercase letter
  2. ([1-9]{9}): Search for 9 int numbers: 1,...,9
  3. (z)?: Search if there is a z or skip
  4. (M{2}): Search for 2 M
  5. ([1-9]{4}): Search for 4 int numbers: 1,...,9

Output:

  format         1 2  3    4  var2
1    M1B 123456789   MM 1158 code1
2    M1C 123456789 z MM 1183 code8
三人与歌 2025-02-09 00:57:35

这是一个基于您的formatm1b/c向量进行分割的函数,

f1 <- function(string, vec){
  start <- c(1, cumsum(vec)[-length(vec)] + 1)
  end <- cumsum(vec)
  apply(data.frame(start, end), 1, function(i)substring(string, i[1], i[2]))
}

我们可以将其应用于

Map(function(x, y) f1(x, y), df$var1,list(formatM1B, formatM1C))

#$M1B123456789MM1158
#[1] "M1B"       "123456789" "MM"        "1158"     

#$M1C123456789zMM1183
#[1] "M1C"       "123456789" "z"         "MM"        "1183"     

Here is a function that does the splitting based on your formatM1B/C vectors,

f1 <- function(string, vec){
  start <- c(1, cumsum(vec)[-length(vec)] + 1)
  end <- cumsum(vec)
  apply(data.frame(start, end), 1, function(i)substring(string, i[1], i[2]))
}

And we can apply it as,

Map(function(x, y) f1(x, y), df$var1,list(formatM1B, formatM1C))

#$M1B123456789MM1158
#[1] "M1B"       "123456789" "MM"        "1158"     

#$M1C123456789zMM1183
#[1] "M1C"       "123456789" "z"         "MM"        "1183"     
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文