C#,读取固定宽度记录,在一个文件中改变记录类型

发布于 2024-09-07 18:03:05 字数 2959 浏览 6 评论 0 原文

首先,我想澄清一下,我并不是非常精通 C#。其中,我正在使用 .Net 3.5 使用 C# 进行的一个项目让我构建一个类来读取和导出包含基于记录类型的多种固定宽度格式的文件。

目前有5种类型的记录,由文件每行的第一个字符位置指示,指示特定的行格式。我遇到的问题是这些类型彼此不同。

Record type 1 has 5 columns, signifies beginning of the file

Record type 3 has 10 columns, signifies beginning of a batch
Record type 5 has 69 columns, signifies a transaction
Record type 7 has 12 columns, signifies end of the batch, summarizes
(these 3 repeat throughout the file to contain each batch)

Record type 9 has 8 columns, signifies end of the file, summarizes

对于这些类型的固定宽度文件,是否有一个好的库?我见过一些不错的人想要将整个文件作为一个规范加载,但这行不通。

每个月末大约会读取 250 个这样的文件,合并后的文件大小平均约为 300 兆。在这个项目中,效率对我来说非常重要。

根据我对数据的了解,我构建了一个我“认为”对象应该是什么样子的类层次结构......

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Extract_Processing
{
    class Extract
    {
        private string mFilePath;
        private string mFileName;
        private FileHeader mFileHeader;
        private FileTrailer mFileTrailer;
        private List<Batch> mBatches;       // A file can have many batches

        public Extract(string filePath)
        { /* Using file path some static method from another class would be called to parse in the file somehow */ }

        public string ToString()
        { /* Iterates all objects down the heiarchy to return the file in string format */ }

        public void ToFile()
        { /* Calls some method in the file parse static class to export the file back to storage somewhere */ }
    }

    class FileHeader
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class Batch
    {
        private string mBatchNumber;                // Should this be pulled out of the batch header to make LINQ querying simpler for this data set?
        private BatchHeader mBatchHeader;
        private BatchTrailer mBatchTrailer;
        private List<Transaction> mTransactions;    // A batch can have multiple transactions

        public string ToString()
        { /* Iterates through batches to return what the entire batch would look like in string format */ }
    }

    class BatchHeader
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class Transaction
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class BatchTrailer
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class FileTrailer
    { /* ... contains data types for all fields in this format, ToString etc */ }

}

我遗漏了许多构造函数和其他方法,但我认为这个想法应该非常可靠。我正在寻找对我正在考虑的方法的想法和批评,我对 C# 不了解,并且执行时间是最高优先级。

除了一些批评之外,最大的问题是,我应该如何引入这个文件?我引入了其他语言的许多文件,例如使用 FSO 方法的 VBA、Microsoft Access ImportSpec 来读取文件(5 次,每个规范一个......哇,效率很低!),在Visual FoxPro(这是 FAAAAAAAST,但又不得不做五次),但我正在寻找 C# 中隐藏的宝石,如果所说的事情存在的话。

感谢您阅读我的小说,如果您在理解它时遇到问题,请告诉我。我将利用周末的时间来检查这个设计,看看我是否会购买它并愿意努力以这种方式实现它。

To start I would like to clarify that I'm not extremely well versed in C#. In that, a project I'm doing working in C# using .Net 3.5 has me building a class to read from and export files that contain multiple fixed width formats based on the record type.

There are currently 5 types of records indicated by the first character position in each line of the file that indicate a specific line format. The problem I have is that the types are distinct from each other.

Record type 1 has 5 columns, signifies beginning of the file

Record type 3 has 10 columns, signifies beginning of a batch
Record type 5 has 69 columns, signifies a transaction
Record type 7 has 12 columns, signifies end of the batch, summarizes
(these 3 repeat throughout the file to contain each batch)

Record type 9 has 8 columns, signifies end of the file, summarizes

Is there a good library out there for these kinds of fixed width files? I've seen a few good ones that want to load the entire file in as one spec but that won't do.

Roughly 250 of these files are read at the end of every month and combined filesize on average is about 300 megs. Efficiency is very important to me in this project.

Based on my knowledge of the data I've build a class hierarchy of what I "think" an object should look like...

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Extract_Processing
{
    class Extract
    {
        private string mFilePath;
        private string mFileName;
        private FileHeader mFileHeader;
        private FileTrailer mFileTrailer;
        private List<Batch> mBatches;       // A file can have many batches

        public Extract(string filePath)
        { /* Using file path some static method from another class would be called to parse in the file somehow */ }

        public string ToString()
        { /* Iterates all objects down the heiarchy to return the file in string format */ }

        public void ToFile()
        { /* Calls some method in the file parse static class to export the file back to storage somewhere */ }
    }

    class FileHeader
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class Batch
    {
        private string mBatchNumber;                // Should this be pulled out of the batch header to make LINQ querying simpler for this data set?
        private BatchHeader mBatchHeader;
        private BatchTrailer mBatchTrailer;
        private List<Transaction> mTransactions;    // A batch can have multiple transactions

        public string ToString()
        { /* Iterates through batches to return what the entire batch would look like in string format */ }
    }

    class BatchHeader
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class Transaction
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class BatchTrailer
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class FileTrailer
    { /* ... contains data types for all fields in this format, ToString etc */ }

}

Ive left out many constructors and other methods but I think the idea should be pretty solid. I'm looking for ideas and critique to the methods I'm considering as again, not knowledgable about C# and the execution time is the highest priority.

Biggest question besides some critique is, how should I bring in this file? I've brought in many files in other languages such as VBA using FSO methods, Microsoft Access ImportSpec to read in the file (5 times, one for each spec... wow that was inefficient!), created a 'Cursor' object in visual foxpro (which was FAAAAAAAST but again, had to do five times) but am looking for hidden gems in C# if said things exist.

Thanks for reading my novel, let me know if your having issues understanding it. I'm taking the weekend to go over this design to see if I buy it and want to take the effort to implement it this way.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

往昔成烟 2024-09-14 18:03:05

文件助手很不错。它有一些缺点,因为它似乎不再处于积极开发状态,并且它使您对字段使用公共变量而不是让您使用属性。但其他方面都很好。

你用这些文件做什么?您是否将它们加载到 SQL Server 中?如果是这样,并且您正在寻找快速且简单的方法,我建议您采用如下设计:

  1. 在数据库中创建与 5 种记录类型相对应的临时表。还可以考虑添加 LineNumber 列和 FileName 列,以便您可以将问题追溯到文件本身。
  2. 逐行读取文件并将其解析为业务对象,或直接解析为与表相对应的 ADO.NET DataTable 对象。
  3. 如果您使用业务对象,请应用数据转换或业务规则,然后将数据放入与表相对应的 DataTable 对象中。
  4. 一旦每个 DataTable 达到适当的 BatchSize(例如 1000 条记录),请使用 SqlBulkCopy 对象将数据泵入临时表中。每次SqlBulkCopy操作后,清除DataTable并继续处理。
  5. 如果您不想使用业务对象,请在 SQL Server 中执行所有最终数据操作。

您可能可以用不到 500 行 C# 代码完成整个任务。

FileHelpers is nice. It has a couple of drawbacks in that it doesn't seem to be under active development anymore, and it makes you use public variables for your fields instead of letting you use properties. But otherwise good.

What are you doing with these files? Are you loading them into SQL Server? If so, and you're looking for FAST and SIMPLE, I'd recommend a design like this:

  1. Make staging tables in your database that correspond to each of the 5 record types. Consider adding a LineNumber column and a FileName column too just so you can trace problems back to the file itself.
  2. Read the file line by line and parse it out into your business objects, or directly into ADO.NET DataTable objects that correspond to your tables.
  3. If you used business objects, apply your data transformations or business rules and then put the data into DataTable objects that correspond to your tables.
  4. Once each DataTable reaches an appropriate BatchSize (say 1000 records), use the SqlBulkCopy object to pump the data into your staging tables. After each SqlBulkCopy operation, clear out the DataTable and continue processing.
  5. If you didn't want to use business objects, do any final data manipulation in SQL Server.

You could probably accomplish the whole thing in under 500 lines of C#.

五里雾 2024-09-14 18:03:05

除了一些批评之外,最大的问题是,我应该如何引入这个文件?

我不知道有什么好的文件 IO 库,但阅读起来非常简单。

使用 64kB 缓冲区实例化 StreamReader 类 来限制磁盘 IO 操作(我的估计是每个月底每个文件平均有 1500 笔交易)。

现在您可以流式传输文件:
1) 在每行的开头使用Read来确定记录的类型。
2) 使用 ReadLine 方法和 String.Split 方法获取列值。
3) 使用列值创建对象。

或者

您可以手动缓冲 Stream 中的数据和 IndexOf+SubString 以获得更高的性能(如果做得正确)。

此外,如果行不是列而是二进制格式的原始数据类型,则可以使用 BinaryReader 类,提供一种非常简单且高性能的方式来读取对象。

Biggest question besides some critique is, how should I bring in this file?

I do not know of any good library for file IO, but the reading is pretty straightforward.

Instantiate a StreamReader class using a 64kB buffer to limit disk IO operations (my estimations is 1500 transactions average per file per the end of the month).

Now you can stream over the file:
1) Using the Read at the beggining of each line to determine the type of the record.
2) Using the ReadLine method with the String.Split method to get column values.
3) Create the object using the column values.

or

You could just buffer the data from a Stream manually and IndexOf+SubString for more performance (if done right).

Also if the lines weren't columns but primitive datatypes in binary format, you could use the BinaryReader class for a very easy and performant way to read the objects.

清引 2024-09-14 18:03:05

我的一个批评是你没有正确实现 ToString。

    public string ToString()

应该是:

    public override string ToString()

One critique I have is that you are not correctly implementing ToString.

    public string ToString()

Should be:

    public override string ToString()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文