从大文件中按数字读取行
我有一个包含 1500 万行的文件(内存无法容纳)。我还有一个小的行号向量 - 我想要提取的行。
如何一次性读出这些行?
我希望有一个 C 函数可以一次完成它。
I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
诀窍是使用连接并在
read.table
之前打开它:您也可以尝试
scan
,它更快并且提供更多控制。The trick is to use connection AND open it before
read.table
:You may also try
scan
, it is faster and gives more control.如果它是二进制文件
一些讨论在这里:
仅读取 Stata .DTA 文件的一部分在 R 中,
如果它是 CSV 或其他文本文件,
如果它们是连续的并且位于文件顶部,则只需使用
read.csv
的,nrows
参数或任何read.table
系列。如果没有,您可以组合,nrows
和,skip
参数来重复调用read.csv
(读取新行或一组新数据)每次调用的连续行),然后将结果rbind
在一起。If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the
,nrows
argument toread.csv
or any of theread.table
family. If not, you can combine the,nrows
and the,skip
arguments to repeatedly callread.csv
(reading in a new row or group of contiguous rows with each call) and thenrbind
the results together.如果您的文件具有固定的行长度,那么您可以使用“seek”跳转到任何字符位置。因此,只需针对您想要的每个 N 跳转到 N * line_length,然后读取一行。
但是,从 R 文档来看:
您还可以在 C 中使用标准 C 库中的“seek”,但我不知道上述警告是否也适用!
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
在我能够获得 R 解决方案/答案之前,我已经在 Ruby 中完成了它:
运行速度快(与我的存储读取文件的速度一样快)。
Before I was able to get an R solution/answer, I've done it in Ruby:
runs fast (as fast as my storage can read the file).
我根据讨论编译了一个解决方案 此处。
这只会显示行数,但不会读取任何内容。如果你确实想跳过空行,你可以将最后一个参数设置为 TRUE。
I compile a solution based on the discussions here.
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.