处理并合并两个大文件

发布于 2024-11-15 17:14:42 字数 504 浏览 5 评论 0原文

我需要读入两个大文件(超过 125 MB)。每个文件都包含具有相似数据的记录。我需要找到它们中的记录,然后如果记录的字段不匹配,我需要使用文件一中的记录中包含的字段覆盖文件二中的记录。

例如,第一个文件具有以下字段:

ID, ACCT, Bal, Int, Rate 

第二个文件具有以下字段:

TYPE, ID, ACCT, Bal, Int, Rate.  

因此,如果文件 1 中的记录与文件 2 中的记录具有相同的 ACCT 编号,则文件 2 中的 Bal、Int 和 Rate 需要为用文件 1 中的 Bal、Int 和 Rate 值覆盖。

某些记录不会位于每个文件中。我需要创建的输出文件是文件二中的所有记录,如果该记录不在文件一中,那么它将按原样写入该文件,但随后将包含需要更改的记录。

我尝试了许多不同的选项,但大多数都不足以有效处理大文件。解决这个问题的正确方向是什么?预先感谢您的任何帮助。

I need to read in two large files (over 125 MB). Each file contains records that have similar data. I need to find the records that are in both of them and then if the fields of the records dont match I need to overwrite the records in file two with the fields that are contained in the records from file one.

For example the first file has the following fields:

ID, ACCT, Bal, Int, Rate 

The second file has the following fields:

TYPE, ID, ACCT, Bal, Int, Rate.  

So if a record in file 1 has the same ACCT number as a record in file 2 then the Bal, Int, and Rate in file 2 need to be overwritten with the value of Bal, Int, and Rate from file 1.

Some of the records won't be in each file. The output file I need to create is all the records from file two and if the record is not also in file one then it will write to the file as is, but then the records that need to be changed will then be included.

I have tried many different options but most are not efficient enough to deal with the large files. What is the proper direction to take with this problem? Thanks in advance for any help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夜空下最亮的亮点 2024-11-22 17:14:42

定义两个特定于类型的类,每个文件一个。

class FileOne
{ 
    public int LineNumber {get;set};
    public int Id{get;set;}; 
    public double Bal {get;set;};
...
}

class FileTwo
{ 
    public int LineNumber {get;set};
    public string TranType{get;set;};  // type = reserved word
    public int Id{get;set;}; 
    public double Bal {get;set;};
...
}

将文件加载到 IList<> 中对于每个文件,您都有 IList myFileOne 和 IList myFileTwo,并捕获每个条目的行号,以便您知道它们出现在文件中的位置。

现在使用linq来查询两者之间的差异:

var diffs = from f1 in myFileOne
            join f2 in myFileTwo on f1.Id = f2.Id
            where f1.Bal != f2.Bal // add whatever conditions you need here
            select new {
                f1.Id, f2.Bal, f2.Int, f2.Rate, f1.LineNum
            }

Diffs将成为select中4个字段的可枚举集合。现在您可以迭代它并使用 f1.LineNum 从 myFileOne 中找到正确的行号,并使用 f2 中找到的值更新它。

这有帮助吗?或者您对如何访问文件本身更感兴趣?

Define two type specific classes, one for each file.

class FileOne
{ 
    public int LineNumber {get;set};
    public int Id{get;set;}; 
    public double Bal {get;set;};
...
}

class FileTwo
{ 
    public int LineNumber {get;set};
    public string TranType{get;set;};  // type = reserved word
    public int Id{get;set;}; 
    public double Bal {get;set;};
...
}

Load your file into an IList<> for each file so you have IList myFileOne and IList myFileTwo, and capture the line number of each entry so you know where they appear in the file.

Now use linq to query the differences between the two:

var diffs = from f1 in myFileOne
            join f2 in myFileTwo on f1.Id = f2.Id
            where f1.Bal != f2.Bal // add whatever conditions you need here
            select new {
                f1.Id, f2.Bal, f2.Int, f2.Rate, f1.LineNum
            }

Diffs will become an enumerable collection of the 4 fields in the select. Now you can iterate through that and using f1.LineNum find the right line number from myFileOne and update it with the values found in f2.

Does that help or were you more interested in how to access the file itself?

再浓的妆也掩不了殇 2024-11-22 17:14:42

将文件 1 中的所有记录加载到 哈希表 中ACCT 作为关键。
循环遍历文件 2 中的所有记录并根据需要进行更新。

复杂度:O(n)

HTH

Load all records from file 1 into a hash table with ACCT as key.
Loop over all records in file 2 and update if needed.

Complexity: O(n)

HTH

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文