用于处理固定宽度文件的高效模式
我有一个案例,我需要读取一个包含接近 100000 条逻辑记录的平面文件。每个逻辑记录由 nx128 个字符部分组成。即,类型 A:3x128,类型 B:4-5 X 128 等,其中最大可能的 n 为 6。
应用程序必须读取文件并处理记录。问题是只有当我们读取每个 nx128 分区的前 52 个字符时才能确定“n”。
您能否建议我可以重复使用的任何设计模式或任何有效的算法来执行此操作?
注意:1. 性能是一个重要标准,因为应用程序每天需要处理数千个文件。 2. 数据不以行分隔。它是一个长串状的图案
I have a case where in I need to read a flat file with close to 100000 logical records. Each logical record is comprised of nx128 character parts. ie, Type A: 3x128, Type B : 4-5 X 128 etc where maximum possible n is 6.
Application has to read the file and process the records. The problem is 'n' can be determined only when we read the first 52 characters of each nx128 partition.
Could you please suggest any design paterns which I can re-use or any efficient algorithms to perform this ?
Note : 1. Performance is an important criteria as application need to process thousands of file like this everyday. 2. The data is not separated by lines. Its a long string like pattern
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以采用主-工作者(或主-从)模式,其中主线程将负责读取数据的前 52 个字符以确定记录的长度。然后,主设备可以将读取和处理记录的实际工作推迟到工作线程,并再次移至下一条记录以仅读取前 52 个字符。每个工作人员将负责(重新)打开文件并处理特定范围的字符;需要向工人提供此信息。
因为,我还没有看到文件的结构,所以我只能发布一些可能的限制或问题供实现者考虑:
You could adopt a master-worker (or master-slave) pattern where in a master thread would be responsible for reading the first 52 characters of data to determine the length of the record. The master may then defer the actual work of reading and processing the records to a worker thread, and move on to the next record again to read only the first 52 characters. Each worker would be responsible for (re)opening the file and processing a particular range of characters; the worker needs to be provided with this information.
Since, I haven't seen the structure of the file, I can only post a few possible limitations or concerns for an implementer to think about:
除非您可以更改格式,否则您必须解决它。
您可以为每个文件创建一个索引,但您必须读取一次才能构建索引(但这样可以节省多次执行此操作的麻烦)
Unless you can change the format you have to work around it.
You can create an index for each file, but you would have to read it once to build the index (but it would save having to do this more than once)