是否可以从大型固定宽度CSV文件中有效获取行的子集?
我有一个极大的固定宽度CSV文件(130万行和80k列)。大约230 GB。我需要能够获取这些行的一个子集。我有一个我需要的行索引向量。但是,我现在需要弄清楚如何遍历这样的庞大文件以获取它们。
我理解它的方式,C ++将逐行浏览文件,直到击中Newline(或给定的定界符),此时,它将清除缓冲区,然后移至下一步。我还听说了一个可以转到流中给定位置的seek()
函数。因此,是否可以以某种方式使用此功能来快速获取正确的行号指针?
我认为,由于该程序基本上不必运行数十亿个IF语句来检查新线,因此如果我简单地告诉程序在固定宽度文件中去哪里,它可能会提高速度。但是我不知道该怎么做。
假设我的文件具有n
字符的宽度,我的行号为{l_1,l_2,l_3,... l_m}
(其中l_1< l_2< ...< l_m
)。在这种情况下,我可以简单地告诉文件指针到(l_1-1) * n
,对吗?但是,对于下一行,我是从l_1
行的末端还是从下一行的开头计算下一个跳跃?在计算跳跃时,我应该包括新线吗?
这甚至有助于提高速度,还是我只是在这里误解了一些东西?
感谢您抽出宝贵的时间来
编辑:该文件看起来像这样:
id0000001,AB,AB,AA,--,BB
id0000002,AA,--,AB,--,BB
id0000003,AA,AA,--,--,BB
id0000004,AB,AB,AA,AB,BB
I have an extremely large fixed-width CSV file (1.3 million rows and 80K columns). It's about 230 GB in size. I need to be able to fetch a subset of those rows. I have a vector of row indices that I need. However, I need to now figure out how to traverse such a massive file to get them.
The way I understand it, C++ will go through the file line by line, until it hits the newline (or a given delimiter), at which point, it'll clear the buffer, and then move onto the next line. I have also heard of a seek()
function that can go to a given position in a stream. So is it possible to use this function somehow to get the pointer to the correct line number quickly?
I figured that since the program doesn't have to basically run billions of if statements to check for newlines, it might improve the speed if I simply tell the program where to go in the fixed-width file. But I have no idea how to do that.
Let's say that my file has a width of n
characters and my line numbers are {l_1, l_2, l_3, ... l_m}
(where l_1 < l_2 < l_3, ... < l_m
). In that case, I can simply tell the file pointer to go to (l_1 - 1) * n
, right? But then for the next line, do I calculate the next jump from the end of the l_1
line or from the beginning of the next line? And should I include the newlines when calculating the jumps?
Will this even help improve speed, or am I just misunderstanding something here?
Thanks for taking the time to help
EDIT: The file will look like this:
id0000001,AB,AB,AA,--,BB
id0000002,AA,--,AB,--,BB
id0000003,AA,AA,--,--,BB
id0000004,AB,AB,AA,AB,BB
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如我在评论中提出的那样,您可以将数据字段压缩到两个位:
将文件大小切成12次,因此〜20GB。考虑到您的处理可能是io-bound的,您可能会加快处理相同的12次加速处理。
结果文件的记录长度为20,000个字节,因此很容易计算出任何给定记录的偏移。没有新的行符号要考虑:)
这是我构建该二进制文件的方式:
这可以在一秒钟内编码约1,600个记录,因此整个文件将花费约15分钟。您现在需要多长时间处理它?
更新:
添加了如何从
src
读取各个记录的示例。我仅设法使
seekg()
在二进制模式下工作。As I proposed in the comment, you can compress your data field to two bits:
That cuts your file size 12 times, so it'll be ~20GB. Considering that your processing is likely IO-bound, you may speed up processing by the same 12 times.
The resulting file will have a record length of 20,000 bytes, so it will be easy to calculate an offset to any given record. No new line symbols to consider :)
Here is how I build that binary file:
This can encode about 1,600 records in a second, so the whole file will take ~15 minutes. How long does it take you now to process it?
UPDATE:
Added example of how to read individual records from
src
.I only managed to make
seekg()
work in binary mode.寻求
&lt; iostream&gt;
类中的功能系列通常面向字节。您可以使用它们,如果您绝对相信您的记录(在这种情况下为行)具有固定的字节数;在这种情况下,您可以将文件打开为二进制文件,而不是getline
,并使用.Read
可以将指定的字节读取到足够容量的字节数组中。但是 - 由于文件毕竟是在存储文本 - 如果一个单个记录的大小不同,您也会摆脱对齐;如果保证 id 字段等于线的数字,或者至少是越来越多的映射,则有教育的猜测和后续试验&amp;错误可能会有所帮助。您需要快速切换到一些更好的数据库管理;即使是10GB的单个二进制文件也太大了,容易发生快速腐败。您可以考虑将其切成较小的切片(可能为100MB的订单),以最大程度地减少损害传播的机会。另外,您必须需要一些冗余机制才能恢复/校正。The
seek
family of functions in<iostream>
classes are generally byte-oriented. You can use them, iff you're absolutely confident that your records(lines in this case) have fixed count of bytes; in that case, instead ofgetline
, you can open the file as binary and use.read
that can read the specified number of bytes into a byte array of enough capacity. But - because the file is storing text after all - in case even one single record has a different size, you`ll get out of alignment; if the id field is guaranteed to equal the line number - or at-least an increasing mapping of it -an educated guess and a follow-up trial & error can help. You need to switch to some better database management fast; even a 10GB single binary file is too large and prone to fast corruption. You may consider chopping it into much smaller slices(order of 100MB maybe) so as to minimize the chance for damage propagation. Plus you gotta need some redundancy mechanism for recovery/correction.