将后缀附加到 csv 文件(或 SQLite 数据库)中的条目列
我有一个相对较大的 csv 文件(1.2gb...大到我一台计算机上的 2gb RAM)。对于一列中的每个条目,我想附加“1C”,以便我可以与另一个数据帧/数据库表连接/合并。
如果文件不是那么大,可以很容易地使用read.csv
导入到data
,然后使用data$symbol <-paste(data $symbol, "1C", sep="")
。但现在我收到 can't allocate vector of size x
警告。
像 scan()
这样的手动解决方案是我唯一的选择吗? (我有点担心损坏我的数据)谢谢!
I have a relatively large csv files (1.2gb... large to the 2gb RAM on one of my computers). To every entry in one column I would like to append "1C" so that I can join/merge with another dataframe/db table.
If the file weren't so large, it would be easy to use read.csv
to import to data
then use data$symbol <- paste(data$symbol, "1C", sep="")
. But now I get the can't allocate vector of size x
warning.
Is a manual solution, like scan()
, my only option? (I'm a bit afraid of corrupting my data) Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这有点难看,但是您可以逐行读取该文件并将其逐行放入另一个文件中。这样你就永远不会把整个事情读进记忆里。这与 Joshua 的 awk 方法没有什么不同,但使用 R(而且速度要慢得多)。以下示例是我拼凑而成的示例,用于从 csv 中删除美元符号和百分号。您可以更改中间的逻辑来执行您想要的逻辑。
This is kinda ugly, but you could read the file, line by line and put it in another file, line by line. That way you would never read the whole thing into memory. This would be not unlike Joshua's awk approach, but using R (and considerably slower). The following example was one I had cobbled together to remove dollar signs and percent signs from a csv. You can alter the logic in the middle to do the logic you want.
如果您已经可以将数据导入 R,那么使用
scan
不会有帮助。请确保
data
仅包含您需要合并的列,然后运行 gc ()
,然后再尝试paste
命令(如果您接近内存限制,gc
将会有所帮助)。如果失败,请查看 此中的一些解决方案线程。
更新:
如果您碰巧使用 *nix 的风格,或者您在 Windows 上安装了 Rtools,则可以使用
gawk
来完成此操作。如果您的数据位于foo.csv
中,并且您想要将“C1”添加到第二列,这将创建一个新文件bar.csv
,其中包含“C1” ” 附加到第二列。这可能比 R 更快,并且消耗的内存量可以忽略不计。
Using
scan
isn't going to help if you can already get your data into R.Make sure
data
only has the columns you need to merge, and rungc()
before you try yourpaste
command (gc
will help if you're near the margin of your memory limit).If that fails, look at some of the solutions in this thread.
UPDATE:
And if you happen to be using a flavor of *nix, or if you have Rtools installed on windows, you could do this with
gawk
. If your data are infoo.csv
and you want to add the "C1" to the second column, this will create a new file,bar.csv
, with "C1" appended to the second column.This will likely be faster than R and will consume a negligible amount of memory.
只读取一列,我假设可能是稍后将操作合并到内存中的主要唯一标识符:
Read only the one column and what I am assuming might be a leading unique identifier for later merge operations into memory:
将 1C 附加到第二列:
perl -F, -lane '$F[1] .= "1C"; print join ",", @F' file.csv
使用这些命令行选项:
-n
循环输入文件的每一行-l
在处理之前删除换行符,并在之后将它们添加回-a
自动分割模式 - 将输入行分割到@F
数组中。默认为按空格分割。-e
执行perl代码-F
自动分割修饰符,在本例中分割,
@F
是数组每行中的单词数,以$F[0]
开头索引
.=
是“附加到”运算符Appends 1C to the second column:
perl -F, -lane '$F[1] .= "1C"; print join ",", @F' file.csv
These command-line options are used:
-n
loop around every line of the input file-l
removes newlines before processing, and adds them back in afterwards-a
autosplit mode – split input lines into the@F
array. Defaults to splitting on whitespace.-e
execute the perl code-F
autosplit modifier, in this case splits on,
@F
is the array of words in each line, indexed starting with$F[0]
.=
is the "append to" operator