将后缀附加到 csv 文件（或 SQLite 数据库）中的条目列

发布于 2024-09-30 16:09:22 字数 353 浏览 3 评论 0原文

我有一个相对较大的 csv 文件（1.2gb...大到我一台计算机上的 2gb RAM）。对于一列中的每个条目，我想附加“1C”，以便我可以与另一个数据帧/数据库表连接/合并。

如果文件不是那么大，可以很容易地使用read.csv导入到data，然后使用data$symbol <-paste(data $symbol, "1C", sep="")。但现在我收到 can't allocate vector of size x 警告。

像 scan() 这样的手动解决方案是我唯一的选择吗？（我有点担心损坏我的数据）谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

耳钉梦 2024-10-07 16:09:22

这有点难看，但是您可以逐行读取该文件并将其逐行放入另一个文件中。这样你就永远不会把整个事情读进记忆里。这与 Joshua 的 awk 方法没有什么不同，但使用 R（而且速度要慢得多）。以下示例是我拼凑而成的示例，用于从 csv 中删除美元符号和百分号。您可以更改中间的逻辑来执行您想要的逻辑。

con  <- file(inputFile, open = "r")
con2 <- file(temporaryFile, open = "w")

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
    oneLine <- gsub("%", "", oneLine)     #strip out the percent signs
    oneLine <- gsub("\\$", "", oneLine)   #strip out the dollar signs
    cat(oneLine, file = con2, sep = "\n") #spit the line back out
  } 

close(con)
close(con2)

This is kinda ugly, but you could read the file, line by line and put it in another file, line by line. That way you would never read the whole thing into memory. This would be not unlike Joshua's awk approach, but using R (and considerably slower). The following example was one I had cobbled together to remove dollar signs and percent signs from a csv. You can alter the logic in the middle to do the logic you want.

con  <- file(inputFile, open = "r")
con2 <- file(temporaryFile, open = "w")

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
    oneLine <- gsub("%", "", oneLine)     #strip out the percent signs
    oneLine <- gsub("\\$", "", oneLine)   #strip out the dollar signs
    cat(oneLine, file = con2, sep = "\n") #spit the line back out
  } 

close(con)
close(con2)

回复收藏 0 原文

流心雨 2024-10-07 16:09:22

如果您已经可以将数据导入 R，那么使用 scan 不会有帮助。

请确保 data 仅包含您需要合并的列，然后运行 gc ()，然后再尝试 paste 命令（如果您接近内存限制，gc 将会有所帮助）。

如果失败，请查看此中的一些解决方案线程。

更新：
如果您碰巧使用 *nix 的风格，或者您在 Windows 上安装了 Rtools，则可以使用 gawk 来完成此操作。如果您的数据位于 foo.csv 中，并且您想要将“C1”添加到第二列，这将创建一个新文件 bar.csv，其中包含“C1” ” 附加到第二列。

compy: /home/josh
> cat foo.csv 
1,one,2,two
3,three,4,four
5,five,6,six

compy: /home/josh
> gawk -F "," '{OFS=","; $2=($2 "1C"); print}' < foo.csv > bar.csv

compy: /home/josh
> cat bar.csv 
1,one1C,2,two
3,three1C,4,four
5,five1C,6,six

这可能比 R 更快，并且消耗的内存量可以忽略不计。

Using scan isn't going to help if you can already get your data into R.

Make sure data only has the columns you need to merge, and run gc() before you try your paste command (gc will help if you're near the margin of your memory limit).

If that fails, look at some of the solutions in this thread.

UPDATE:
And if you happen to be using a flavor of *nix, or if you have Rtools installed on windows, you could do this with gawk. If your data are in foo.csv and you want to add the "C1" to the second column, this will create a new file, bar.csv, with "C1" appended to the second column.

compy: /home/josh
> cat foo.csv 
1,one,2,two
3,three,4,four
5,five,6,six

compy: /home/josh
> gawk -F "," '{OFS=","; $2=($2 "1C"); print}' < foo.csv > bar.csv

compy: /home/josh
> cat bar.csv 
1,one1C,2,two
3,three1C,4,four
5,five1C,6,six

This will likely be faster than R and will consume a negligible amount of memory.

回复收藏 0 原文

各自安好 2024-10-07 16:09:22

只读取一列，我假设可能是稍后将操作合并到内存中的主要唯一标识符：

twocol <- read.table(file="what.csv", colClasses=("numeric", "character", rep("NULL", <number-of-remaining-columns>) , sep="," )
twocol$V2 <- paste(twocol$V2, "1C", sep="")
merge(twocol, otherdf, by.x="V1" , ....)

Read only the one column and what I am assuming might be a leading unique identifier for later merge operations into memory:

twocol <- read.table(file="what.csv", colClasses=("numeric", "character", rep("NULL", <number-of-remaining-columns>) , sep="," )
twocol$V2 <- paste(twocol$V2, "1C", sep="")
merge(twocol, otherdf, by.x="V1" , ....)

回复收藏 0 原文