用于处理固定宽度文件的高效模式

发布于 2024-11-16 19:45:44 字数 300 浏览 3 评论 0原文

我有一个案例,我需要读取一个包含接近 100000 条逻辑记录的平面文件。每个逻辑记录由 nx128 个字符部分组成。即,类型 A:3x128,类型 B:4-5 X 128 等,其中最大可能的 n 为 6。

应用程序必须读取文件并处理记录。问题是只有当我们读取每个 nx128 分区的前 52 个字符时才能确定“n”。

您能否建议我可以重复使用的任何设计模式或任何有效的算法来执行此操作?

注意:1. 性能是一个重要标准,因为应用程序每天需要处理数千个文件。 2. 数据不以行分隔。它是一个长串状的图案

I have a case where in I need to read a flat file with close to 100000 logical records. Each logical record is comprised of nx128 character parts. ie, Type A: 3x128, Type B : 4-5 X 128 etc where maximum possible n is 6.

Application has to read the file and process the records. The problem is 'n' can be determined only when we read the first 52 characters of each nx128 partition.

Could you please suggest any design paterns which I can re-use or any efficient algorithms to perform this ?

Note : 1. Performance is an important criteria as application need to process thousands of file like this everyday. 2. The data is not separated by lines. Its a long string like pattern

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

墟烟 2024-11-23 19:45:44

您可以采用主-工作者(或主-从)模式,其中主线程将负责读取数据的前 52 个字符以确定记录的长度。然后,主设备可以将读取和处理记录的实际工作推迟到工作线程,并再次移至下一条记录以仅读取前 52 个字符。每个工作人员将负责(重新)打开文件并处理特定范围的字符;需要向工人提供此信息。

因为,我还没有看到文件的结构,所以我只能发布一些可能的限制或问题供实现者考虑:

  • 有效且高性能的实现将依赖于为工作线程提供文件指针和Worker 应该处理的数据长度。简而言之,工作线程实际上应该以随机访问模式读取文件,而不是让主线程执行读取(这是串行的)。如果您无法执行随机访问,则您无法做很多事情来优化主从模式。
  • 不建议生成新的工作线程。使用线程池。这也意味着您可以根据池的大小限制打开的文件描述符的数量。
  • 将进一步的请求排队以处理字符范围,以防池耗尽。这样,主服务器可以继续执行其工作,直到读取最后一条记录。
  • 记录之间的依赖关系将要求您序列化处理记录。如果每个记录都可以在它自己的线程上处理,而不需要其他线程的结果可用,那么采用这种方法应该不会遇到任何困难。

You could adopt a master-worker (or master-slave) pattern where in a master thread would be responsible for reading the first 52 characters of data to determine the length of the record. The master may then defer the actual work of reading and processing the records to a worker thread, and move on to the next record again to read only the first 52 characters. Each worker would be responsible for (re)opening the file and processing a particular range of characters; the worker needs to be provided with this information.

Since, I haven't seen the structure of the file, I can only post a few possible limitations or concerns for an implementer to think about:

  • An effective and performant implementation would rely on the ability to provide a worker thread with file pointers and the length of the data that the worker should deal with. In simpler words, the worker thread is expected to actually read the file in a random-access mode, instead of having the master do the reading (which is serial). If you cannot perform random-access, there isn't a lot you can do to optimize the master-worker pattern.
  • Spawning of new worker threads is not recommended. Use a thread pool. This would also mean that you can limit the number of open file descriptors based on the size of the pool.
  • Queue up further requests to process the character ranges in case the pool is exhausted. That way, the master can continue doing its work until the last record has been read.
  • Dependencies across records will require you to serialize processing the records. If each record can be processed on it's own thread, without requiring results from other threads to be made available, then you should not encounter any difficult in adopting this approach.
瞄了个咪的 2024-11-23 19:45:44

除非您可以更改格式,否则您必须解决它。

您可以为每个文件创建一个索引,但您必须读取一次才能构建索引(但这样可以节省多次执行此操作的麻烦)

Unless you can change the format you have to work around it.

You can create an index for each file, but you would have to read it once to build the index (but it would save having to do this more than once)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文