grep 还是 pmatch？

发布于 2024-12-10 09:12:20 字数 1425 浏览 4 评论 0原文

我正在尝试从目录导入一系列文件并将它们每个转换为数据帧。我还想使用文件标题创建两个具有标题相关值的新列。输入文件的格式为：xx_yy.out 其中 XX 目前可以是三个值之一。 YY 目前有两个可能的值。未来这些数字将会上升。

根据评论编辑解决方案（参见下面的原始问题）

再次编辑以反映@JoshO'Brien的建议

filelist <- as.list(dir(pattern = ".*.out"))

for(i in filelist) {

    tempdata  <- read.table(i)                  #read the table
    filelistshort <- gsub(".out$", "", i)       #remove the end of the file
    tempsplit <- strsplit(filelistshort, "_")   #remove the underscore
    xx <- sapply(tempsplit, "[", 1)             #get xx
    yy <- sapply(tempsplit, "[", 2)             #get yy
    tempdata$XX <- xx                           #add XX column
    tempdata$YY <- yy                           #add YY column
    assign(gsub(".out","",i), tempdata)         # give the dataframe a shortened name

}

下面是原始代码，显示我想使用某种方法来获取 XX 和 YY 值，但不确定最好的方法：

我的大纲（在 @romanlustrik post 之后）是如下所示：

filelist <- as.list(dir(pattern = ".*.out"))
lapply(filelist, FUN = function(x) {
    xx <- grep() or pmatch()
    yy <- grep() or pmatch()
    x <- data.frame(read.table(x)) 
    x$colx <- xx
    x$coly <- yy
    return(x)
})

其中 xx <- 和 yy <- 行将是基于 pmatch 或 grep 的查找。我正在尝试让其中任何一个发挥作用，但欢迎任何建议。

原文

I am trying to import a series of files from directory and convert each of them into a dataframe. I would also like to use the file title to create two new columns with title-dependent values. Input files have the format: xx_yy.out
Where XX can currently be one of three values. YY currently has two possible values. In the future these numbers will go up.

Edit of Solution based on the comments (see below for the original question)

edited again to reflect suggestions of @JoshO'Brien

filelist <- as.list(dir(pattern = ".*.out"))

for(i in filelist) {

    tempdata  <- read.table(i)                  #read the table
    filelistshort <- gsub(".out$", "", i)       #remove the end of the file
    tempsplit <- strsplit(filelistshort, "_")   #remove the underscore
    xx <- sapply(tempsplit, "[", 1)             #get xx
    yy <- sapply(tempsplit, "[", 2)             #get yy
    tempdata$XX <- xx                           #add XX column
    tempdata$YY <- yy                           #add YY column
    assign(gsub(".out","",i), tempdata)         # give the dataframe a shortened name

}

Below is the original code showing that I wanted to use some means to ge teh XX and YY values but wasn't sure of the best way:

My outline (after @romanlustrik post ) is as follows:

filelist <- as.list(dir(pattern = ".*.out"))
lapply(filelist, FUN = function(x) {
    xx <- grep() or pmatch()
    yy <- grep() or pmatch()
    x <- data.frame(read.table(x)) 
    x$colx <- xx
    x$coly <- yy
    return(x)
})

where the xx <- and yy <- lines would be a lookup based on either pmatch or grep. I am playing around to make either one work but would welcome any suggestions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花想c 2024-12-17 09:12:20

如果我们可以假设您的文件名仅包含一个 "_"，我就不会使用 grep() 或 pmatch()根本不。

strsplit() 似乎提供了一个更干净、更简单的解决方案：

filelist <- c("aa_mm.out", "bb_mm.out", "cc_nn.out")

# Remove the trailing ".out"
rootNames <- gsub(".out$", "", filelist)

# Split string at the "_"
rootParts <- strsplit(rootNames, "_")

# Extract the first and second parts into character vectors
xx <- sapply(rootParts, "[", 1)
yy <- sapply(rootParts, "[", 2)

xx
# [1] "aa" "bb" "cc"
yy
# [1] "mm" "mm" "nn"

If we can assume that your file names will contain only a single "_", I wouldn't use grep() or pmatch() at all.

strsplit() seems to provide a cleaner and simpler solution:

filelist <- c("aa_mm.out", "bb_mm.out", "cc_nn.out")

# Remove the trailing ".out"
rootNames <- gsub(".out$", "", filelist)

# Split string at the "_"
rootParts <- strsplit(rootNames, "_")

# Extract the first and second parts into character vectors
xx <- sapply(rootParts, "[", 1)
yy <- sapply(rootParts, "[", 2)

xx
# [1] "aa" "bb" "cc"
yy
# [1] "mm" "mm" "nn"

回复收藏 0 原文

番薯 2024-12-17 09:12:20

这是一个丑陋的黑客，但完成了工作。

fl <- c("12_34.out", "ab_23.out", "02_rk.out")
xx <- regexpr(pattern = ".._", text = fl)
XX <- (substr(fl, start = xx, stop = xx + attr(xx, "match.length")-1))
  [1] "12" "ab" "02"
yy <- regexpr(pattern = "_..", text = fl)
YY <- (substr(fl, start = yy + 1, stop = yy + attr(yy, "match.length")-1))
  [1] "34" "23" "rk"

This is an ugly hack, but gets the job done.

fl <- c("12_34.out", "ab_23.out", "02_rk.out")
xx <- regexpr(pattern = ".._", text = fl)
XX <- (substr(fl, start = xx, stop = xx + attr(xx, "match.length")-1))
  [1] "12" "ab" "02"
yy <- regexpr(pattern = "_..", text = fl)
YY <- (substr(fl, start = yy + 1, stop = yy + attr(yy, "match.length")-1))
  [1] "34" "23" "rk"

回复收藏 0 原文

~没有更多了~