运行时关键,C 中 CSV 文件的读取操作
有没有一种方法可以编写一种快速、有效的方式来读取 csv 文件?[这里要注意的一点是:我说的是一个包含一百万行以上的 csv 文件]
运行时间是这里的关键指标。
互联网上的一项资源集中于使用二进制文件操作来批量读取。但我确信,如果它对读取 CSV 文件有帮助,
还有其他方法,例如 Robert Gamble 编写的 SourceForge 代码。有没有办法使用本机函数来编写它?
编辑:让我们以更清晰、更好的方式分割整个问题:
是否有一种有效的(运行时关键的)方法来读取 C 中的文件? (在本例中是一百万行长的 .csv 文件)
是否有一种快速有效的方法来解析 csv 文件?
Is there a way to code a swift, efficient way of reading csv files?[the point to note here is: I am talking about a csv file with a million+ lines]
The Run Time is the critical metric here.
One resource on internet concentrated on using binary file operations to read in bulk. But I am sure, if it will be helpful in reading CSV files
There are other methods as well, like Robert Gamble written SourceForge code. Is there a way to write it using native functions?
Edit: Lets split the entire question in a clearer and better way:
Is there an efficient (Run Time critical) way to read files in C? (in this case a million rows long .csv file)
Is there a swift efficient way to parse a csv file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
没有一种方法可以始终最快地读取和解析任何类型的文件。但是,您可能想要为 CSV 构建 Ragel 语法;这些往往相当快。您可以将其调整为适合您的特定 CSV 类型(逗号分隔、
;
分隔、仅数字等),并且可能会跳过您不会使用的任何数据。我对特定于数据集的 SQL 解析器有很好的经验,这些解析器可以跳过大部分输入(数据库转储)。批量读取可能是一个好主意,但您应该根据实际数据来衡量它是否真的比 stdio 缓冲更快。在 Windows 上使用二进制 I/O 可能会加快速度,但随后您需要在其他地方处理换行符。
There is no single way of reading and parsing any type of file that is fastest all the time. However, you might want to build a Ragel grammar for CSVs; those tend to be pretty fast. You can adapt it to your specific type of CSV (comma-separated,
;
-separated, numbers only, etc.) and perhaps skip over any data that you're not going to use. I've had good experience with dataset-specific SQL parsers that could skip over much of their input (database dumps).Reading in bulk might be a good idea, but you should measure on actual data whether it it's really faster than
stdio
-buffering. Using binary I/O might speed things up a bit on Windows, but then you need to handle newlines somewhere else.根据我的经验,CSV 文件的解析(即使是高级解释语言)通常也不是瓶颈。通常大量数据会占用大量空间; CSV 文件很大,大部分加载时间是 I/O,即硬盘将大量数字读取到内存中。
所以我强烈的建议是考虑压缩 CSV。
gzip
的工作非常高效,它能够即时压缩和恢复 CSV 流,通过大大减小文件大小和 I/O 时间来加快保存和加载速度。如果您在 Unix 下进行开发,您可以尝试此操作,而无需任何额外代码,并受益于通过
gzip -c
和gunzip -c
管道 CSV 输入和输出。试试吧——对我来说,它使速度加快了数十倍。In my experience, the parsing of CSV files — even in higher-level interpreted language — isn't usually a bottleneck. Usually huge amounts of data take a lot of space; CSV files are big, and most of the loading time is I/O, that is, the hard drive reading the tons of digits into memory.
So my strong advice is to consider compressing the CSVs.
gzip
does it's job very efficiently, it manages to squash and restore CSV streams on-the-fly, speeding up saving and loading by means of greatly decreasing file size and thus I/O time.If you are developing under Unix, you may try this at cost of no additional code at all, benefiting from piping CSV input and output through
gzip -c
andgunzip -c
. Just try it — for me it sped up things tens of times.使用
setvbuf
将输入缓冲区设置为比默认值大得多的大小。这是在 C 语言中唯一可以提高读取速度的方法。还要进行一些时序测试,因为会有一个收益递减点,超过这个点就没有必要增加缓冲区大小。在 C 之外,您可以首先将该 .CSV 放入 SSD 驱动器,或将其存储在压缩文件系统上。
Set the input buffer to a much larger size than the default using
setvbuf
. This is the only thing that you can do in C to increase the read speed. Also do some timing tests because there will be a poingt of diminishing returns beyond which there is no point in increasing the buffer size.Outside of C you can start by putting that .CSV onto an SSD drive, or store it on a compressed filesystem.
您所能期望的最好的结果是将大块文本拖入内存(或“内存映射”文件),并在内存中处理文本。
效率中的棘手问题是文本行是可变长度记录。一般来说,读取文本直到找到行尾终止符。一般来说,这意味着读取一个字符,并检查eol。许多平台和库尝试通过读取数据块并在数据中搜索eol来提高效率。
您的 CSV 格式使问题进一步复杂化。在 CSV 文件中,字段是可变长度记录。再次搜索终止字符,例如逗号、制表符或竖线。
如果您想要更好的性能,则必须将数据布局更改为固定字段长度和固定记录长度。如有必要,请填充字段。应用程序可以删除额外的填充。就读取而言,固定长度记录非常高效。只需读取 N 个字节。不扫描,只是转储到某处的缓冲区中。
固定长度字段允许随机访问记录(或文本行)。字段的索引是恒定的并且可以很容易地计算。无需搜索。
总之,可变长度记录和字段本质上并不是最有效的数据结构。时间被浪费在寻找终端字符上。固定长度记录和固定长度字段更有效,因为它们不需要搜索。
如果您的应用程序是数据密集型的,也许重组数据将使程序更加高效。
The best you can hope for is to haul large blocks of text into memory (or "memory map" a file), and process the text in memory.
The thorn in the efficiency is that text lines are variable length records. Generally, text is read until an end of line terminator is found. In general, this means reading a character, and checking for eol. Many platforms and libraries try make this more efficient by reading blocks of data and searching the data for eol.
Your CSV format further complicates the issue. In a CSV file, the fields are variable length records. Again, searching for a terminal character such as a comma, tab or vertical bar.
If you want better performance, you will have to change the data layout to fixed field lengths and fixed record lengths. Pad fields if necessary. The applications can remove the extra padding. Fixed length records are very efficient as far as reading is concerned. Just read N number of bytes. No scanning, just dump into a buffer somewhere.
Fixed length fields allow for random access into the record (or text line). The index into a field is constant and can be calculated easily. No searching required.
In summary, variable length records and fields are by their nature, not the most efficient data structure. Time is wasted searching for terminal characters. Fixed length records and fixed length fields are more efficient since they don't require searching.
If your application is data intensive, perhaps restructuring the data will make the program more efficient.