Hadoop中如何处理每天增加的数据
在Hadoop中如何处理每天增加的数据:
例如:
第一天我可能在某个输入文件夹(例如hadoop/demo)中有100万个文件
第二天在同一个文件夹中,文件可能会从现有的100万个文件+另一个新的100万个文件增加文件总共有200万个。
同样的第三天第四天......继续前进。
我的限制是 ->第一天的文件不应在第二天处理。
(即)当添加新文件时,不应再次处理已处理的文件。更具体地说,只应处理新添加的文件,而应忽略较旧的文件。
所以请帮助我解决这个问题。
不过,如果您不理解该约束,请指出不清楚的地方,以便我可以详细说明我的约束!
In Hadoop how to handle daily increasing data:
For example:
1st day I may have 1 million files in some input folder (e.g. hadoop/demo)
2nd day in the same folder, files may increase from existing 1 million files + another new 1 million files so totally 2 million.
likewise 3rd 4th days... keep goes.
My constraint is -> 1st day's files should not be processed on the next day.
(i.e) Already proceeded files should not processed again when new files are added with them. More specifically, only the new added files should be processed and older files should be neglected.
So help me in the way that I can solve this issue.
Still if you didn't understand the constraint, kindly say where it's unclear so that I can elaborate more about my constraint!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
它不是 hadoop 本身支持的某种思想,因为它是应用程序逻辑的一部分。
我建议使用一些基于 HDFS 的解决方案,这样您将拥有目录(或每天带有子目录的更好的目录层次结构),其中包含尚未处理的数据。
您的日常工作应该在那里获取所有数据,对其进行处理并移至“已处理”文件夹。
通常有意义的权衡是以某些文件的意外双重处理不会导致问题的方式进行逻辑处理。
。在这种情况下,作业在处理后但在移动之前崩溃不会产生问题。
您可以使用 oozie 的一些 wokrflow 工具来代替每日调度,这些工具能够通过数据可用性触发作业,尽管我个人还没有使用它们。
It is not somethinkg supported by hadoop itself, since it is part of the application logic.
I would suggest some HDFS based solution, so you will have directory (or better hierarchy of directories with subdirectory for each day) with data yet to be processed.
Your daily job should take all data there, process it and move to the "processed" folder.
Usual trade-off which makes sense is to make logic in the way that accidental double processing of some file will not cause problems.
. In this case crash of the job after processing, but before the move will not make a problems.
Instead of daily scheduling you might use some wokrflow tools lie oozie capable to trigger jobs by the data availability, alhough I am personally didn't used them yet.