CSV 随机访问; C#
我有一个 10GB 的 CSV 文件,它本质上是一个巨大的方阵。我正在尝试编写一个可以尽可能有效地访问矩阵的单个单元格的函数,即矩阵[12345,20000]。
鉴于其大小,显然不可能将整个矩阵加载到二维数组中,我需要以某种方式直接从文件中读取值。
我已经用 Google 搜索过使用 FileStream.Seek 查看文件随机访问,但不幸的是,由于变量舍入,每个单元格不是固定宽度。我不可能通过某种算术寻找特定字节并知道我正在查看哪个单元格。
我考虑扫描文件并为每行第一个字节的索引创建一个查找表。这样,如果我想访问矩阵[12345,20000],我将寻找第 12345 行的开头,然后扫描整行,计算逗号,直到到达正确的单元格。
我正想尝试这个,但是其他人有更好的想法吗?我确信我不会是第一个尝试处理这样的文件的人。
干杯
编辑:我应该注意到该文件包含一个非常稀疏的矩阵。如果解析 CSV 文件的速度太慢,我会考虑将文件转换为更合适、更易于处理的文件格式。存储稀疏矩阵的最佳方法是什么?
I have a 10GB CSV file which is essentially a huge square matrix. I am trying to write a function that can access a single cell of the matrix as efficiently as possible, ie matrix[12345,20000].
Given its size, it is obviously not possible to load the entire matrix into a 2D array, I need to somehow read the values direct from the file.
I have Googled around looking at file random access using FileStream.Seek, however unfortunately due to variable rounding each cell isn't a fixed width. It would not be possible for me to seek to a specific byte and know what cell I'm looking at by some sort of arithmetic.
I considered scanning the file and creating a lookup table for the index of the first byte of each row. That way, if I wanted to access matrix[12345,20000] I would seek to the start of row 12345 and then scan across the line, counting the commas until I reach the correct cell.
I am about to try this, but has anyone else got any better ideas? I'm sure I wouldn't be the first person to try and deal with a file like this.
Cheers
Edit: I should note that the file contains a very sparse matrix. If parsing the CSV file ends up being too slow, I would consider converting the file to a more appropriate, and easier to process, file format. What is the best way to store a sparse matrix?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我已经使用 Lumenworks CSV 阅读器来读取相当大的 CSV 文件,可能值得快速查看一下它解析文件的速度。
Lumenworks CSV
I have used Lumenworks CSV reader for quite large CSV files, it may be worth a quick look to see how quickly it can parse your file.
Lumenworks CSV
首先,您想如何引用特定行?它是该行的索引,以便您有另一个表或其他可以帮助您知道您感兴趣的行的内容吗?或者是通过id什么的?
这些想法浮现在脑海中
First of all, how would you want to refer to a particular row? Is it the index of the row so that you have another table or something that will help you know which row you are interested? or is it by an id or something?
These ideas come to mind
索引文件将是你能做的最好的事情。我敢打赌。由于行的大小未知,除了扫描文件或有索引之外,无法直接跳到该行。
唯一的问题是你的索引有多大。如果它太大,您可以通过仅每 5 行(例如)建立索引并在 5 行范围内扫描来缩小它。
Index-file would be the best you could do. I bet. Having unknown size of row, there is no way to skip directly to the line other than either scan the file or have an index.
The only question is how large your index is. If it is too large, you could make it smaller by indexing only every 5th (for example) line and scan in range of 5 lines.
预处理文件,使字段具有固定宽度。然后你就可以轻松地进行随机阅读。
通过过去执行类似的操作,您应该能够编写一些简单的代码,从本地磁盘读取 10G 可变宽度文件,并在几分钟(~20)分钟内将 10G 固定宽度文件写入本地磁盘。前期投资是否获得回报取决于您需要执行的随机读取次数以及要读取的文件更改的频率。
Pre-process the file so that the fields are fixed width. Then you can do your random read easily.
From doing similar sorts of in the past you should be able to write some simple code that reads the 10G variable width file from a local disk and writes a 10G fixed width file to a local disk in a few (~20) minutes. If that up front time investment pays off depends on how many random reads you need to do and how often the file to be read changes.
如果您创建了 12345 个使用延迟实例化读取的单独文件会怎样?仅当需要数据时才会读取每个文件。如果数据完全稀疏,您可以创建一个具有 IsEmpty bool 属性的数据结构。
您是否需要一遍又一遍地访问同一个元素,或者是否只需要读取每个元素一次?
What if you created 12345 separate file that are read with Lazy instantiation. Each file would only be read if the data was needed. If the data is completely sparse you could create a data structure with an IsEmpty bool property.
Do you need to access the same element over and over or do you need to just read each element once?
我不同意您不应将文件加载到 RAM 中,特别是如果您使用 64 位操作系统。
分配大小为 12345x20000 的矩阵应该不是问题:双精度时大约只有 1.9 GB。事实上,即使大小更大,我仍然会在64位平台下推荐这种方法(参见“虚拟内存”)。
其次,您声明您的矩阵是稀疏的,因此您可以加载到 RAM 中,但使用稀疏表示来节省一些内存。
总之,如果您的应用程序需要对矩阵进行多次访问并且性能有些重要,那么将其放入 RAM 绝对是我最喜欢的方法。
I disagree that you shouldn't load the file into RAM, especially if you use a 64bit OS.
It shouldn't be a problem to allocate a matrix of size 12345x20000 : that's only about 1.9 GB in double precision. And in fact even if the size was bigger, I would still recommend this approach under a 64bit platform (see "virtual memory").
Secondly you stated that your matrix was sparse, hence you could load into RAM but use a sparse representation to spare some memory.
In conclusion if your application requires many access to your matrix and performance is somewhat important, putting it in RAM would definitely be my favourite approach.