C# CSV动态分割

发布于 2024-12-12 05:43:37 字数 1004 浏览 0 评论 0原文

我有多个 1.5 GB CSV 文件,其中包含服务提供商客户的多个帐户的账单信息。我正在尝试将大型 CSV 文件分割成较小的块,以便处理和格式化其中的数据。

我不想推出我自己的 CSV 解析器,但这是我还没有见过的东西,所以如果我错了,请纠正我。 1.5GB 文件按以下顺序包含信息:帐户信息、帐号、帐单日期、交易、Ex gst 、 Inc gst 、类型和其他行。

请注意,此处的 BillDate 表示开具发票的日期,因此有时我们会在同一个 CSV 中包含两个以上的账单日期。

帐单按以下条件分组:帐号 >账单日期>交易。

有些账户有 10 行交易详细信息,有些账户有超过 300,000 行交易详细信息。一个 1.5GB 的大型 CSV 文件包含大约 800 万行数据(我之前使用 UltraEdit)来将粘贴剪切成更小的块,但这已经变得非常低效且耗时的过程。

我只想在我的 WinForm 中加载大型 CSV 文件,单击一个按钮,这会将这个大文件分割成不超过 250,000 行的块,但有些账单实际上大于 250,000 行,在这种情况下将它们保留在一个块中,不要将帐户拆分到多个文件中,因为无论如何它们都是有序的。另外,我不希望帐户在 CSV 中具有多个帐单日期,在这种情况下,拆分器可以创建另一个附加拆分。

我已经有一个 WinForm 应用程序,可以在 VS C# 2010 中自动对较小文件中的 CSV 进行格式化。

实际上是否可以处理这个非常大的 CSV 文件?我一直在尝试加载大文件,但 MemoryOutOfException 很烦人,因为它每次都会崩溃,而且我不知道如何修复它。我愿意接受建议。

我认为我应该这样做:

  • 加载大型 CSV 文件(但由于 OutOfMemoryException 而失败)。怎么解决这个问题呢?
  • 按帐户名称、帐单日期对数据进行分组,并计算每组的行数。
  • 然后创建一个整数数组。
  • 将此整数数组传递给文件分割器进程,该进程将获取这些数组并写入数据块。

任何建议将不胜感激。

谢谢。

I have multiple 1.5 GB CSV Files which contain billing information on multiple accounts for clients from a service provider. I am trying to split the large CSV file into smaller chunks for processing and formatting the data inside it.

I do not want to roll out my own CSV parser but this is something I haven't seen yet so please correct me if I am wrong. The 1.5GB files contains information in the following order: account information, account number, Bill Date, transactions , Ex gst , Inc gst , type and other lines.

note that BillDate here means the date when the invoice was made, so occassionally we have more than two bill dates in the same CSV.

Bills are grouped by : Account Number > Bill Date > Transactions.

Some accounts have 10 lines of Transaction details, some have over 300,000 lines of Transaction details. A large 1.5GB CSV file contains around 8million lines of data (I used UltraEdit before) to cut paste into smaller chunks but this has become very inefficient and a time consuming process.

I just want to load the large CSV files in my WinForm, click a button, which will split this large files in chunks of say no greater than 250,000 lines but some bills are actually bigger than 250,000 lines in which case keep them in one piece and not split accounts across multiple files since they are ordered anyway. Also I do not wan't accounts with multiple bill date in CSV in which case the splitter can create another additional split.

I already have a WinForm application that does the formatting of the CSV in smaller files automatically in VS C# 2010.

Is it actually possible to process this very large CSV files? I have been trying to load the large files but MemoryOutOfException is an annoyance since it crashes everytime and I don't know how to fix it. I am open to suggestions.

Here is what I think I should be doing:

  • Load the large CSV file (but fails since OutOfMemoryException). How to solve this?
  • Group data by account name, bill date, and count the number of lines for each group.
  • Then create an array of integers.
  • Pass this array of integers to a file splitter process which will take these arrays and write the blocks of data.

Any suggestions will be greatly appreciated.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

开始看清了 2024-12-19 05:43:37

您可以使用 CsvReader 流式传输和解析数据,而无需加载数据一口气全部记入记忆。

You can use CsvReader to stream through and parse the data, without needing to load it all into memory in one go.

戈亓 2024-12-19 05:43:37

是的……对于巨大的文件来说,内存不足将会发生。你需要认真对待你的情况。

与大多数问题一样,将所有问题分解为多个步骤。

我以前也遇到过类似的情况(CSV 格式的大数据文件,需要处理等)。

我所做的:

制作程序套件的第 1 步或其他任何东西,只是将您的大文件切成许多较小的文件。我已经将 5GB 压缩的 PGP 加密文件(解密后......这是另一个令人头疼的问题)分成许多较小的部分。您可以做一些简单的事情,例如按顺序对它们进行编号(即 001、002、003...),

然后制作一个应用程序来进行输入处理。这里没有真正的业务逻辑。当涉及到业务逻辑时,我非常讨厌 FILE IO,而且我喜欢数据在漂亮的 SQL Server 数据库中的温暖模糊的感觉。这就是我。我创建了一个线程池,并有 N 个线程(比如 5 个,你决定你的机器可以处理多少)读取你创建的 .csv 零件文件。

每个线程读取一个文件。一对一的关系。因为它是文件 I/O,所以请确保不要同时运行太多。每个线程执行相同的基本操作。读取数据,将其放入数据库(表格式)的基本结构中,进行大量插入,然后结束线程。我使用 LINQ to SQL 因为一切都是强类型的,什么不是,但每个都是强类型的。数据库设计越好,您以后执行逻辑就越好。

所有线程执行完毕后,您将获得数据库中原始 CSV 中的所有数据。现在您可以执行所有业务逻辑并从那里执行任何操作。这不是最漂亮的解决方案,但考虑到我的情况/数据流/大小/要求,我被迫开发它。你可能会选择完全不同的东西。我猜只是分享。

Yea about that.... being out of memory is going to happen with files that are HUGE. You need to take your situation seriously.

As with most problems, break everything into steps.

I have had a similar type of situation before (large data file in CSV format, need to process, etc).

What I did:

Make step 1 of your program suite or whatever, something that merely cuts your huge file into many smaller files. I have broken 5GB zipped up PGP encrypted files (after decryption...thats another headache) into many smaller pieces. You can do something simple like numbering them sequentially (i.e. 001, 002, 003...)

Then make an app to do the INPUT processing. No real business logic here. I hate FILE IO with a passion when it comes to business logic and I love the warm fuzzy feeling of data being in a nice SQL Server DB. That's just me. I created a thread pool and have N amount of threads (like 5, you decide how much your machine can handle) read those .csv part files you created.

Each thread reads one file. One to one relationship. Because it is file I/O, make sure you only dont have too many running at the same time. Each thread does the same basic operation. Reads in data, puts it in a basic structure for the db (table format), does lots of inserts, then ends the thread. I used LINQ to SQL because everything is strongly typed and what not, but to each their own. The better the db design the better for you later to do logic.

After all threads have finished executing, you have all the data from the original CSV in the database. Now you can do all your business logic and do whatever from there. Not the prettiest solution, but I was forced into developing that given my situation/data flow/size/requirements. You might go with something completely different. Just sharing I guess.

爺獨霸怡葒院 2024-12-19 05:43:37

您可以使用外部排序。我想您必须对文件进行初始遍历才能识别正确的行边界,因为 CSV 记录可能没有固定长度。

希望您可以使用一些现成的 .NET 外部排序实现。

You can use an external sort. I suppose you'd have to do an initial pass through the file to identify proper line boundaries, as CSV records are probably not of a fixed length.

Hopefully, there might be some ready-made external sort implementations for .NET that you could use.

困倦 2024-12-19 05:43:37

Microsoft.VisualBasic.FileIO 命名空间中有一个非常有用的类,我用它来处理 CSV 文件 - TextFieldParser 类

它可能对大文件没有帮助,但它是内置的,可以处理带引号和不带引号的字段(即使混合在同一行中)。我在工作项目中使用过它几次。

不管程序集名称如何,如果您想知道的话,它可以与 C# 一起使用。

There's a very useful class in the Microsoft.VisualBasic.FileIO namespace that I've used for dealing with CSV files - the TextFieldParser Class.

It might not help with the large file size, but it's built-in and handles quoted and non-quoted fields (even if mixed in the same line). I've used it a couple of times in projects at work.

Despite the assembly name, it can be used with C#, in case you're wondering.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文