使用 C# 解析大型 csv 文件中前两列的快速/低内存方法

发布于 2024-11-14 16:11:34 字数 625 浏览 4 评论 0 原文

我正在解析一个大型 csv 文件 - 大约 500 meg（许多行，许多列）。我只需要前两列（所以直到每行的第二个逗号）。另外，多个线程需要同时访问该文件，因此我无法获取独占锁。

解决这个问题最快/最少内存消耗的方法是什么？我应该关注哪些课程/方法？我认为我应该尽可能保持低水平 - 逐字符、逐行阅读？

也许这是一种允许同时访问的方法？

using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
     using ( var reader = new StreamReader( filestream ) )
     {
       ...
     }
}

编辑
决定查看 http://www.codeproject.com/KB/database/CsvReader.aspx 这似乎使我能够只阅读两列，然后跳到下一行。他们还有一些基准测试显示快速的性能和低内存配置文件。

原文

I'm parsing a large csv files - about 500 meg (many rows, many columns). I only need the first two columns (so up to the second comma on each line). Also, multiple threads need access to this file at the same time, so I can't take an exclusive lock.

What's the fastest/least memory consuming approach to this problem? What classes/methods should I be looking at? I assume that I should stay as low-level as possible - reading character by character, line by line?

Perhaps this is a way to allow simultaneous access?

using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
     using ( var reader = new StreamReader( filestream ) )
     {
       ...
     }
}

Edit
Decided to check out http://www.codeproject.com/KB/database/CsvReader.aspx
which seems to give me the ability to read just two columns and then skip to the next line.
They also have some benchmarks showing fast performance and low memory profile.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半边脸i 2024-11-21 16:11:34

如果您想要低内存，您可能会使用 StreamReader 和 ReadLine 逐行。

前几天在类似的情况下，我能够跳过 500 MB 文件中的前 20,000,000 行，并在大约 7 秒内为接下来的 1,000,000 行构建一个字符串（使用 StringBuilder）。

回复收藏 0 原文

兮颜 2024-11-21 16:11:34

假设文件包含 ASCII 编码文本（对于 csv 来说是典型的），最好的选择可能是直接使用 Stream 和 Stream.Read 方法，允许您读入预先分配的缓冲区。这有几个优点：

您只需分配一次缓冲区，而 ReadLine() 将为每一行创建一个新的字符串。
您不必对整行执行 Unicode 转换；您可以仅对第二个逗号之前的部分执行此操作，或者（如果您的时间严重受限），您可以编写自己的数字解析器来对缓冲区中的 ASCII 字符串数据进行操作（我确信有详细记录了执行此操作的算法。）当然，这是假设您需要数字数据。

您可能需要的其他方法包括 ASCII 编码方法，特别是 Encoding.ASCII.GetString。