处理并合并两个大文件
我需要读入两个大文件(超过 125 MB)。每个文件都包含具有相似数据的记录。我需要找到它们中的记录,然后如果记录的字段不匹配,我需要使用文件一中的记录中包含的字段覆盖文件二中的记录。
例如,第一个文件具有以下字段:
ID, ACCT, Bal, Int, Rate
第二个文件具有以下字段:
TYPE, ID, ACCT, Bal, Int, Rate.
因此,如果文件 1 中的记录与文件 2 中的记录具有相同的 ACCT 编号,则文件 2 中的 Bal、Int 和 Rate 需要为用文件 1 中的 Bal、Int 和 Rate 值覆盖。
某些记录不会位于每个文件中。我需要创建的输出文件是文件二中的所有记录,如果该记录不在文件一中,那么它将按原样写入该文件,但随后将包含需要更改的记录。
我尝试了许多不同的选项,但大多数都不足以有效处理大文件。解决这个问题的正确方向是什么?预先感谢您的任何帮助。
I need to read in two large files (over 125 MB). Each file contains records that have similar data. I need to find the records that are in both of them and then if the fields of the records dont match I need to overwrite the records in file two with the fields that are contained in the records from file one.
For example the first file has the following fields:
ID, ACCT, Bal, Int, Rate
The second file has the following fields:
TYPE, ID, ACCT, Bal, Int, Rate.
So if a record in file 1 has the same ACCT number as a record in file 2 then the Bal, Int, and Rate in file 2 need to be overwritten with the value of Bal, Int, and Rate from file 1.
Some of the records won't be in each file. The output file I need to create is all the records from file two and if the record is not also in file one then it will write to the file as is, but then the records that need to be changed will then be included.
I have tried many different options but most are not efficient enough to deal with the large files. What is the proper direction to take with this problem? Thanks in advance for any help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
定义两个特定于类型的类,每个文件一个。
将文件加载到 IList<> 中对于每个文件,您都有 IList myFileOne 和 IList myFileTwo,并捕获每个条目的行号,以便您知道它们出现在文件中的位置。
现在使用linq来查询两者之间的差异:
Diffs将成为select中4个字段的可枚举集合。现在您可以迭代它并使用 f1.LineNum 从 myFileOne 中找到正确的行号,并使用 f2 中找到的值更新它。
这有帮助吗?或者您对如何访问文件本身更感兴趣?
Define two type specific classes, one for each file.
Load your file into an IList<> for each file so you have IList myFileOne and IList myFileTwo, and capture the line number of each entry so you know where they appear in the file.
Now use linq to query the differences between the two:
Diffs will become an enumerable collection of the 4 fields in the select. Now you can iterate through that and using f1.LineNum find the right line number from myFileOne and update it with the values found in f2.
Does that help or were you more interested in how to access the file itself?
将文件 1 中的所有记录加载到 哈希表 中ACCT 作为关键。
循环遍历文件 2 中的所有记录并根据需要进行更新。
复杂度:O(n)
HTH
Load all records from file 1 into a hash table with ACCT as key.
Loop over all records in file 2 and update if needed.
Complexity: O(n)
HTH