通过java程序读取大输入文件(10gb)

发布于 2024-11-25 08:26:39 字数 322 浏览 0 评论 0原文

我正在处理 2 个大型输入文件,每个文件大小约为 5GB。 它是 Hadoop MapReduce 的输出,但由于我无法在 MapReduce 中进行依赖项计算,因此我切换到优化的 for 循环进行最终计算(请参阅我之前关于 MapReduce 设计的问题 使用Mapreduce的递归计算

我想建议在java中读取如此大的文件并执行一些基本操作,最后我会写出大约 5gb 左右的数据..

感谢您的帮助

I am working with a 2 large input files of the order of 5gb each..
It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce

I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..

I appreciate your help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

浪推晚风 2024-12-02 08:26:39

如果文件具有您所描述的属性,即每个键有 100 个整数值,每个值有 10GB,那么您正在谈论的键数量非常大,远远超出了内存的容量。如果您可以在处理之前对文件进行排序,例如使用操作系统排序实用程序或具有单个化简器的 MapReduce 作业,您可以同时读取两个文件,进行处理并输出结果,而无需在内存中保留太多数据。

If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.

自此以后,行同陌路 2024-12-02 08:26:39

听起来简单的实现不会有太多内容。只需打开一个 InputStream/Reader,然后循环:

  1. 读入一份数据
  2. 处理该数据
  3. 存储结果:如果您有完整数据集的空间,则在内存中,如果没有,则在某种数据库中

如果您的结果集太大而无法保留在内存中,解决此问题的一个简单方法是使用 H2 数据库,具有本地文件存储。

It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:

  1. Read in one piece of your data
  2. Process the piece of data
  3. Store the result: in memory if you'll have room for the complete dataset, in a database of some sort if not

If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.

2024-12-02 08:26:39

我的方法是,

将映射缩减程序配置为使用 16 个缩减程序,因此最终输出由 16 个文件(part-00000 到 part-00015)组成,大小为 300+ MB,并且两个输入文件的键以相同的顺序排序。

现在在每个阶段我都会读取 2 个输入文件(大约 600 MB)并进行处理。所以在每个阶段我都必须在内存中保留 600 MB,系统可以很好地管理它。

该程序速度相当快,大约需要 20 分钟即可完成处理。

感谢您的所有建议!,非常感谢您的帮助

My approach,

Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.

Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.

The program was pretty quick took around 20mins for the complete processing.

Thanks for all the suggestions!, I appreciate your help

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文