将后缀附加到 csv 文件(或 SQLite 数据库)中的条目列

发布于 2024-09-30 16:09:22 字数 353 浏览 1 评论 0原文

我有一个相对较大的 csv 文件(1.2gb...大到我一台计算机上的 2gb RAM)。对于一列中的每个条目,我想附加“1C”,以便我可以与另一个数据帧/数据库表连接/合并。

如果文件不是那么大,可以很容易地使用read.csv导入到data,然后使用data$symbol <-paste(data $symbol, "1C", sep="")。但现在我收到 can't allocate vector of size x 警告。

scan() 这样的手动解决方案是我唯一的选择吗? (我有点担心损坏我的数据)谢谢!

I have a relatively large csv files (1.2gb... large to the 2gb RAM on one of my computers). To every entry in one column I would like to append "1C" so that I can join/merge with another dataframe/db table.

If the file weren't so large, it would be easy to use read.csv to import to data then use data$symbol <- paste(data$symbol, "1C", sep=""). But now I get the can't allocate vector of size x warning.

Is a manual solution, like scan(), my only option? (I'm a bit afraid of corrupting my data) Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

耳钉梦 2024-10-07 16:09:22

这有点难看,但是您可以逐行读取该文件并将其逐行放入另一个文件中。这样你就永远不会把整个事情读进记忆里。这与 Joshua 的 awk 方法没有什么不同,但使用 R(而且速度要慢得多)。以下示例是我拼凑而成的示例,用于从 csv 中删除美元符号和百分号。您可以更改中间的逻辑来执行您想要的逻辑。

con  <- file(inputFile, open = "r")
con2 <- file(temporaryFile, open = "w")

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
    oneLine <- gsub("%", "", oneLine)     #strip out the percent signs
    oneLine <- gsub("\\$", "", oneLine)   #strip out the dollar signs
    cat(oneLine, file = con2, sep = "\n") #spit the line back out
  } 

close(con)
close(con2)

This is kinda ugly, but you could read the file, line by line and put it in another file, line by line. That way you would never read the whole thing into memory. This would be not unlike Joshua's awk approach, but using R (and considerably slower). The following example was one I had cobbled together to remove dollar signs and percent signs from a csv. You can alter the logic in the middle to do the logic you want.

con  <- file(inputFile, open = "r")
con2 <- file(temporaryFile, open = "w")

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
    oneLine <- gsub("%", "", oneLine)     #strip out the percent signs
    oneLine <- gsub("\\$", "", oneLine)   #strip out the dollar signs
    cat(oneLine, file = con2, sep = "\n") #spit the line back out
  } 

close(con)
close(con2)
流心雨 2024-10-07 16:09:22

如果您已经可以将数据导入 R,那么使用 scan 不会有帮助。

请确保 data 仅包含您需要合并的列,然后运行 ​​gc (),然后再尝试 paste 命令(如果您接近内存限制,gc 将会有所帮助)。

如果失败,请查看 此中的一些解决方案线程


更新:
如果您碰巧使用 *nix 的风格,或者您在 Windows 上安装了 Rtools,则可以使用 gawk 来完成此操作。如果您的数据位于 foo.csv 中,并且您想要将“C1”添加到第二列,这将创建一个新文件 bar.csv,其中包含“C1” ” 附加到第二列。

compy: /home/josh
> cat foo.csv 
1,one,2,two
3,three,4,four
5,five,6,six

compy: /home/josh
> gawk -F "," '{OFS=","; $2=($2 "1C"); print}' < foo.csv > bar.csv

compy: /home/josh
> cat bar.csv 
1,one1C,2,two
3,three1C,4,four
5,five1C,6,six

这可能比 R 更快,并且消耗的内存量可以忽略不计。

Using scan isn't going to help if you can already get your data into R.

Make sure data only has the columns you need to merge, and run gc() before you try your paste command (gc will help if you're near the margin of your memory limit).

If that fails, look at some of the solutions in this thread.


UPDATE:
And if you happen to be using a flavor of *nix, or if you have Rtools installed on windows, you could do this with gawk. If your data are in foo.csv and you want to add the "C1" to the second column, this will create a new file, bar.csv, with "C1" appended to the second column.

compy: /home/josh
> cat foo.csv 
1,one,2,two
3,three,4,four
5,five,6,six

compy: /home/josh
> gawk -F "," '{OFS=","; $2=($2 "1C"); print}' < foo.csv > bar.csv

compy: /home/josh
> cat bar.csv 
1,one1C,2,two
3,three1C,4,four
5,five1C,6,six

This will likely be faster than R and will consume a negligible amount of memory.

各自安好 2024-10-07 16:09:22

只读取一列,我假设可能是稍后将操作合并到内存中的主要唯一标识符:

twocol <- read.table(file="what.csv", colClasses=("numeric", "character", rep("NULL", <number-of-remaining-columns>) , sep="," )
twocol$V2 <- paste(twocol$V2, "1C", sep="")
merge(twocol, otherdf, by.x="V1" , ....)

Read only the one column and what I am assuming might be a leading unique identifier for later merge operations into memory:

twocol <- read.table(file="what.csv", colClasses=("numeric", "character", rep("NULL", <number-of-remaining-columns>) , sep="," )
twocol$V2 <- paste(twocol$V2, "1C", sep="")
merge(twocol, otherdf, by.x="V1" , ....)
太阳公公是暖光 2024-10-07 16:09:22

将 1C 附加到第二列:

perl -F, -lane '$F[1] .= "1C"; print join ",", @F' file.csv

使用这些命令行选项:

  • -n 循环输入文件的每一行
  • -l在处理之前删除换行符,并在之后将它们添加回
  • -a 自动分割模式 - 将输入行分割到 @F 数组中。默认为按空格分割。
  • -e 执行perl代码
  • -F 自动分割修饰符,在本例中分割

@F 是数组每行中的单词数,以 $F[0]
开头索引
.= 是“附加到”运算符

Appends 1C to the second column:

perl -F, -lane '$F[1] .= "1C"; print join ",", @F' file.csv

These command-line options are used:

  • -n loop around every line of the input file
  • -l removes newlines before processing, and adds them back in afterwards
  • -a autosplit mode – split input lines into the @F array. Defaults to splitting on whitespace.
  • -e execute the perl code
  • -F autosplit modifier, in this case splits on ,

@F is the array of words in each line, indexed starting with $F[0]
.= is the "append to" operator

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文