通过java程序读取大输入文件(10gb)
我正在处理 2 个大型输入文件,每个文件大小约为 5GB。 它是 Hadoop MapReduce 的输出,但由于我无法在 MapReduce 中进行依赖项计算,因此我切换到优化的 for 循环进行最终计算(请参阅我之前关于 MapReduce 设计的问题 使用Mapreduce的递归计算
我想建议在java中读取如此大的文件并执行一些基本操作,最后我会写出大约 5gb 左右的数据..
感谢您的帮助
I am working with a 2 large input files of the order of 5gb each..
It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce
I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..
I appreciate your help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果文件具有您所描述的属性,即每个键有 100 个整数值,每个值有 10GB,那么您正在谈论的键数量非常大,远远超出了内存的容量。如果您可以在处理之前对文件进行排序,例如使用操作系统排序实用程序或具有单个化简器的 MapReduce 作业,您可以同时读取两个文件,进行处理并输出结果,而无需在内存中保留太多数据。
If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.
听起来简单的实现不会有太多内容。只需打开一个 InputStream/Reader,然后循环:
如果您的结果集太大而无法保留在内存中,解决此问题的一个简单方法是使用 H2 数据库,具有本地文件存储。
It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:
If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.
我的方法是,
将映射缩减程序配置为使用 16 个缩减程序,因此最终输出由 16 个文件(part-00000 到 part-00015)组成,大小为 300+ MB,并且两个输入文件的键以相同的顺序排序。
现在在每个阶段我都会读取 2 个输入文件(大约 600 MB)并进行处理。所以在每个阶段我都必须在内存中保留 600 MB,系统可以很好地管理它。
该程序速度相当快,大约需要 20 分钟即可完成处理。
感谢您的所有建议!,非常感谢您的帮助
My approach,
Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.
Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.
The program was pretty quick took around 20mins for the complete processing.
Thanks for all the suggestions!, I appreciate your help