Hadoop MapReduce 中的地图应用程序缓存?
从数据流的角度来看 MapReduce 和 HBase 的组合,我的问题似乎很合适。我有一大堆文档想要映射、合并和减少。我之前的 SQL 实现是将任务拆分为批处理操作,将 Map 的结果累积存储到表中,然后执行相当于归约的操作。这样做的好处是,在执行期间(或执行之间)的任何时候,我都可以及时获得 Map 的结果。
据我了解,作为 MapReduce 运行此作业将需要每次运行所有 Map 函数。
我的 Map 函数(实际上是任何函数)对于给定的输入总是给出相同的输出。如果不需要的话,重新计算输出根本没有意义。我的输入(一组文档)将不断增长,我将定期对数据运行 MapReduce 操作。在执行之间,我只需要计算新添加文档的 Map 函数。
我的数据可能是 HBase -> MapReduce-> HBase。鉴于 Hadoop 是一个完整的生态系统,它可能能够知道给定的函数已应用于具有给定标识的行。我假设 HBase 表中的条目不可变。 Hadoop 是否/可以考虑到这一点?
我从文档(尤其是 Cloudera 视频)中了解到,对于 Hadoop 所处理的问题类别,重新计算(潜在冗余数据)可能比持久化和检索更快。
有什么意见/答案吗?
Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您希望避免每次都运行 Map 步骤,请将其分解为单独的步骤(通过使用 IdentityReducer 或将作业的化简器数量设置为 0),并使用映射步骤的输出运行后续步骤。
这实际上是否比每次从原始数据重新计算更快取决于输入数据与输出数据的数量和形状、映射步骤的复杂程度等。
请注意,在新数据集上运行映射器不会附加到以前的运行 - 但您可以通过使用带日期的输出文件夹来解决此问题。这就是说,您可以将映射第一批文件的输出存储在
my_mapper_output/20091101
中,将下周的批次存储在my_mapper_output/20091108
中,依此类推。如果您如果想要减少整个集合,您应该能够传入 my_mapper_output 作为输入文件夹,并捕获所有输出集。If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in
my_mapper_output/20091101
, and the next week's batch inmy_mapper_output/20091108
, etc. If you want to reduce over the whole set, you should be able to pass inmy_mapper_output
as the input folder, and catch all of the output sets.为什么不在不同的环境中应用您的 SQL 工作流程?意思是,将“已处理”列添加到您的输入表中。当需要运行摘要时,运行一个类似于以下内容的管道:
map (map_function) on (input table Filtered by !processed);存储到 hbase 或简单的 hdfs 中的 map_outputs 中。
在 (map_outputs) 上映射(reduce 函数);存储到hbase中。
假设您将数据存储在按插入日期排序的 Hbase 中,如果您在某处记录成功摘要运行的时间戳,并在日期晚于上次成功摘要的输入上打开过滤器,您可以让生活变得更轻松 - 您将节省大量扫描时间。
这是一个有趣的演示,展示了一家公司如何构建其工作流程(尽管他们不使用 Hbase):
http://www.scribd。 com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability