使用 C# 解析大型 csv 文件中前两列的快速/低内存方法

发布于 2024-11-14 16:11:34 字数 625 浏览 1 评论 0 原文

我正在解析一个大型 csv 文件 - 大约 500 meg(许多行,许多列)。我只需要前两列(所以直到每行的第二个逗号)。另外,多个线程需要同时访问该文件,因此我无法获取独占锁。

解决这个问题最快/最少内存消耗的方法是什么?我应该关注哪些课程/方法?我认为我应该尽可能保持低水平 - 逐字符、逐行阅读?

也许这是一种允许同时访问的方法?

using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
     using ( var reader = new StreamReader( filestream ) )
     {
       ...
     }
}

编辑
决定查看 http://www.codeproject.com/KB/database/CsvReader.aspx 这似乎使我能够只阅读两列,然后跳到下一行。 他们还有一些基准测试显示快速的性能和低内存配置文件。

I'm parsing a large csv files - about 500 meg (many rows, many columns). I only need the first two columns (so up to the second comma on each line). Also, multiple threads need access to this file at the same time, so I can't take an exclusive lock.

What's the fastest/least memory consuming approach to this problem? What classes/methods should I be looking at? I assume that I should stay as low-level as possible - reading character by character, line by line?

Perhaps this is a way to allow simultaneous access?

using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
     using ( var reader = new StreamReader( filestream ) )
     {
       ...
     }
}

Edit
Decided to check out http://www.codeproject.com/KB/database/CsvReader.aspx
which seems to give me the ability to read just two columns and then skip to the next line.
They also have some benchmarks showing fast performance and low memory profile.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

半边脸i 2024-11-21 16:11:34

如果您想要低内存,您可能会使用 StreamReader 和 ReadLine 逐行。

前几天在类似的情况下,我能够跳过 500 MB 文件中的前 20,000,000 行,并在大约 7 秒内为接下来的 1,000,000 行构建一个字符串(使用 StringBuilder)。

If you want low memory, you'll probably use a StreamReader and ReadLine by line.

In a similar case the other day, I was able to skip the first 20,000,000 lines in a 500 MB file and build a string (using StringBuilder) for the next 1,000,000 lines in about 7 seconds.

兮颜 2024-11-21 16:11:34

假设文件包含 ASCII 编码文本(对于 csv 来说是典型的),最好的选择可能是直接使用 Stream 和 Stream.Read 方法,允许您读入预先分配的缓冲区。这有几个优点:

  1. 您只需分配一次缓冲区,而 ReadLine() 将为每一行创建一个新的字符串。

  2. 您不必对整行执行 Unicode 转换;您可以仅对第二个逗号之前的部分执行此操作,或者(如果您的时间严重受限),您可以编写自己的数字解析器来对缓冲区中的 ASCII 字符串数据进行操作(我确信有详细记录了执行此操作的算法。)当然,这是假设您需要数字数据。

您可能需要的其他方法包括 ASCII 编码方法,特别是 Encoding.ASCII.GetString

Assuming that the file contains ASCII encoded text (would be typical for csv), your best bet may be to use Stream directly and the Stream.Read method, which allows you to read into a pre-allocated buffer. This has a few advantages:

  1. You only allocate a buffer once, whereas ReadLine() will create a new String for every line.

  2. You don't have to perform the Unicode conversion for the entire line; you can either do this only for the portion up to the second comma or (if you're severely time-constrained), you can write your own numeric parser that operates on the ASCII string data in the buffer (I'm sure there are well-documented algorithms for doing this.) This is assuming you need numeric data, of course.

Additional methods you'll likely need include the ASCII Encoding methods, particularly Encoding.ASCII.GetString.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文