If you are familiar with Python, following is the simplest possible explanation of MapReduce:
In [2]: data = [1, 2, 3, 4, 5, 6]
In [3]: mapped_result = map(lambda x: x*2, data)
In [4]: mapped_result
Out[4]: [2, 4, 6, 8, 10, 12]
In [10]: final_result = reduce(lambda x, y: x+y, mapped_result)
In [11]: final_result
Out[11]: 42
See how each segment of raw data was processed individually, in this case, multiplied by 2 (the map part of MapReduce). Based on the mapped_result, we concluded that the result would be 42 (the reduce part of MapReduce).
An important conclusion from this example is the fact that each chunk of processing doesn't depend on another chunk. For instance, if thread_1 maps [1, 2, 3], and thread_2 maps [4, 5, 6], the eventual result of both the threads would still be [2, 4, 6, 8, 10, 12] but we have halved the processing time for this. The same can be said for the reduce operation and is the essence of how MapReduce works in parallel computing.
Going all the way down to the basics for Map and Reduce.
Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list.
suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop, say
A = [1, 2, 3]
foreach (item in A) A[item] = A[item] * 2
and I'd have A = [2, 4, 6] but instead of writing loops, if I have a map function I could write
A = [1, 2, 3].Map(x => x * 2)
the x => x * 2 is a function to be executed against the elements in [1,2,3]. What happens is that the program takes each item, execute (x => x * 2) against it by making x equals to each item, and produce a list of the results.
so after executing the map function with (x => x * 2) you'd have [2, 4, 6].
Reduce is a function which "collects" the items in lists and perform some computation on all of them, thus reducing them to a single value.
Finding a sum or finding averages are all instances of a reduce function. Such as if you have a list of numbers, say [7, 8, 9] and you want them summed up, you'd write a loop like this
A = [7, 8, 9]
sum = 0
foreach (item in A) sum = sum + A[item]
But, if you have access to a reduce function, you could write it like this
A = [7, 8, 9]
sum = A.reduce( 0, (x, y) => x + y )
Now it's a little confusing why there are 2 arguments (0 and the function with x and y) passed. For a reduce function to be useful, it must be able to take 2 items, compute something and "reduce" that 2 items to just one single value, thus the program could reduce each pair until we have a single value.
the execution would follows:
result = 0
7 : result = result + 7 = 0 + 7 = 7
8 : result = result + 8 = 7 + 8 = 15
9 : result = result + 9 = 15 + 9 = 24
But you don't want to start with zeroes all the time, so the first argument is there to let you specify a seed value specifically the value in the first result = line.
say you want to sum 2 lists, it might look like this:
A = [7, 8, 9]
B = [1, 2, 3]
sum = 0
sum = A.reduce( sum, (x, y) => x + y )
sum = B.reduce( sum, (x, y) => x + y )
or a version you'd more likely to find in the real world:
A = [7, 8, 9]
B = [1, 2, 3]
sum_func = (x, y) => x + y
sum = A.reduce( B.reduce( 0, sum_func ), sum_func )
Its a good thing in a DB software because, with Map\Reduce support you can work with the database without needing to know how the data are stored in a DB to use it, thats what a DB engine is for.
You just need to be able to "tell" the engine what you want by supplying them with either a Map or a Reduce function and then the DB engine could find its way around the data, apply your function, and come up with the results you want all without you knowing how it loops over all the records.
There are indexes and keys and joins and views and a lot of stuffs a single database could hold, so by shielding you against how the data is actually stored, your code are made easier to write and maintain.
Same goes for parallel programming, if you only specify what you want to do with the data instead of actually implementing the looping code, then the underlying infrastructure could "parallelize" and execute your function in a simultaneous parallel loop for you.
Perform some kind of transformation that converts every datum to another kind of datum
Combine those new data into yet simpler data
Step 2 is Map. Step 3 is Reduce.
For example,
Get time between two impulses on a pair of pressure meters on the road
Map those times into speeds based upon the distance of the meters
Reduce those speeds to an average speed
The reason MapReduce is split between Map and Reduce is because different parts can easily be done in parallel. (Especially if Reduce has certain mathematical properties.)
MapReduce is a method to process vast sums of data in parallel without requiring the developer to write any code other than the mapper and reduce functions.
The map function takes data in and churns out a result, which is held in a barrier. This function can run in parallel with a large number of the same map task. The dataset can then be reduced to a scalar value.
So if you think of it like a SQL statement
SELECT SUM(salary)
FROM employees
WHERE salary > 1000
GROUP by deptname
We can use map to get our subset of employees with salary > 1000 which map emits to the barrier into group size buckets.
Reduce will sum each of those groups. Giving you your result set.
just plucked this from my university study notes of the google paper
for each document
for each word in the document
get the counter associated to the word for the document
increment that counter
end for
end for
MapReduce 实现:
Map phase (input: document key, document)
for each word in the document
emit an event with the word as the key and the value "1"
end for
Reduce phase (input: key (a word), an iterator going through the emitted values)
for each value in the iterator
sum up the value in a counter
end for
Let's take the example from the Google paper. The goal of MapReduce is to be able to use efficiently a load of processing units working in parallels for some kind of algorithms. The exemple is the following: you want to extract all the words and their count in a set of documents.
Typical implementation:
for each document
for each word in the document
get the counter associated to the word for the document
increment that counter
end for
end for
MapReduce implementation:
Map phase (input: document key, document)
for each word in the document
emit an event with the word as the key and the value "1"
end for
Reduce phase (input: key (a word), an iterator going through the emitted values)
for each value in the iterator
sum up the value in a counter
end for
Around that, you'll have a master program which will partition the set of documents in "splits" which will be handled in parallel for the Map phase. The emitted values are written by the worker in a buffer specific to the worker. The master program then delegates other workers to perform the Reduce phase as soon as it is notified that the buffer is ready to be handled.
Every worker output (being a Map or a Reduce worker) is in fact a file stored on the distributed file system (GFS for Google) or in the distributed database for CouchDB.
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, “1”);
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
First of all, why was MapReduce originally created?
Basically Google needed a solution for making large computation jobs easily parallelizable, allowing data to be distributed in a number of machines connected through a network. Aside from that, it had to handle the machine failure in a transparent way and manage load balancing issues.
What are MapReduce true strengths?
One may say that MapReduce magic is based on the Map and Reduce functions application. I must confess mate, that I strongly disagree. The main feature that made MapReduce so popular is its capability of automatic parallelization and distribution, combined with the simple interface. These factor summed with transparent failure handling for most of the errors made this framework so popular.
A little more depth on the paper:
MapReduce was originally mentioned in a Google paper (Dean & Ghemawat, 2004 – link here) as a solution to make computations in Big Data using a parallel approach and commodity-computer clusters. In contrast to Hadoop, that is written in Java, the Google’s framework is written in C++. The document describes how a parallel framework would behave using the Map and Reduce functions from functional programming over large data sets.
In this solution there would be two main steps – called Map and Reduce –, with an optional step between the first and the second – called Combine. The Map step would run first, do computations in the input key-value pair and generate a new output key-value. One must keep in mind that the format of the input key-value pairs does not need to necessarily match the output format pair. The Reduce step would assemble all values of the same key, performing other computations over it. As a result, this last step would output key-value pairs. One of the most trivial applications of MapReduce is to implement word counts.
The pseudo-code for this application, is given bellow:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, “1”);
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
As one can notice, the map reads all the words in a record (in this case a record can be a line) and emits the word as a key and the number 1 as a value. Later on, the reduce will group all values of the same key. Let’s give an example: imagine that the word ‘house’ appears three times in the record. The input of the reducer would be [house,[1,1,1]]. In the reducer, it will sum all the values for the key house and give as an output the following key value: [house,[3]].
Here’s an image of how this would look like in a MapReduce framework:
As a few other classical examples of MapReduce applications, one can say:
•Count of URL access frequency
•Reverse Web-link Graph
•Distributed Grep
•Term Vector per host
In order to avoid too much network traffic, the paper describes how the framework should try to maintain the data locality. This means that it should always try to make sure that a machine running Map jobs has the data in its memory/local storage, avoiding to fetch it from the network. Aiming to reduce the network through put of a mapper, the optional combiner step, described before, is used. The Combiner performs computations on the output of the mappers in a given machine before sending it to the Reducers – that may be in another machine.
The document also describes how the elements of the framework should behave in case of faults. These elements, in the paper, are called as worker and master. They will be divided into more specific elements in open-source implementations. Since the Google has only described the approach in the paper and not released its proprietary software, many open-source frameworks were created in order to implement the model. As examples one may say Hadoop or the limited MapReduce feature in MongoDB.
The run-time should take care of non-expert programmers details, like partitioning the input data, scheduling the program execution across the large set of machines, handling machines failures (in a transparent way, of course) and managing the inter-machine communication. An experienced user may tune these parameters, as how the input data will be partitioned between workers.
Key Concepts:
•Fault Tolerance: It must tolerate machine failure gracefully. In order to perform this, the master pings the workers periodically. If the master does not receive responses from a given worker in a definite time lapse, the master will define the work as failed in that worker. In this case, all map tasks completed by the faulty worker are thrown away and are given to another available worker. Similar happens if the worker was still processing a map or a reduce task. Note that if the worker already completed its reduce part, all computation was already finished by the time it failed and does not need to be reset. As a primary point of failure, if the master fails, all the job fails. For this reason, one may define periodical checkpoints for the master, in order to save its data structure. All computations that happen between the last checkpoint and the master failure are lost.
•Locality: In order to avoid network traffic, the framework tries to make sure that all the input data is locally available to the machines that are going to perform computations on them. In the original description, it uses Google File System (GFS) with replication factor set to 3 and block sizes of 64 MB. This means that the same block of 64 MB (that compose a file in the file system) will have identical copies in three different machines. The master knows where are the blocks and try to schedule map jobs in that machine. If that fails, the master tries to allocate a machine near a replica of the tasks input data (i.e. a worker machine in the same rack of the data machine).
•Task Granularity: Assuming that each map phase is divided into M pieces and that each Reduce phase is divided into R pieces, the ideal would be that M and R are a lot larger than the number of worker machines. This is due the fact that a worker performing many different tasks improves dynamic load balancing. Aside from that, it increases the recovery speed in the case of worker fail (since the many map tasks it has completed can be spread out across all the other machines).
•Backup Tasks: Sometimes, a Map or Reducer worker may behave a lot more slow than the others in the cluster. This may hold the total processing time and make it equal to the processing time of that single slow machine. The original paper describes an alternative called Backup Tasks, that are scheduled by the master when a MapReduce operation is close to completion. These are tasks that are scheduled by the Master of the in-progress tasks. Thus, the MapReduce operation completes when the primary or the backup finishes.
•Counters: Sometimes one may desire to count events occurrences. For this reason, counts where created. The counter values in each workers are periodically propagated to the master. The master then aggregates (Yep. Looks like Pregel aggregators came from this place ) the counter values of a successful map and reduce task and return them to the user code when the MapReduce operation is complete. There is also a current counter value available in the master status, so a human watching the process can keep track of how it is behaving.
Well, I guess with all the concepts above, Hadoop will be a piece of cake for you. If you have any question about the original MapReduce article or anything related, please let me know.
发布评论
评论(8)
我不想听起来很陈词滥调,但这对我帮助很大,而且非常简单:
I don't want to sound trite, but this helped me so much, and it's pretty simple:
如果您熟悉 Python,以下是 MapReduce 最简单的解释:
了解如何单独处理每个原始数据段,在本例中,乘以 2(MapReduce 的 map 部分)。 根据
mapped_result
,我们得出结果为42
(MapReduce 的reduce 部分)。这个例子的一个重要结论是每个处理块不依赖于另一个块。 例如,如果
thread_1
映射[1, 2, 3]
,并且thread_2
映射[4, 5, 6]
code>,两个线程的最终结果仍然是[2, 4, 6, 8, 10, 12]
,但我们已将处理时间减半。 对于reduce操作来说也是如此,这也是MapReduce在并行计算中工作原理的本质。If you are familiar with Python, following is the simplest possible explanation of MapReduce:
See how each segment of raw data was processed individually, in this case, multiplied by 2 (the map part of MapReduce). Based on the
mapped_result
, we concluded that the result would be42
(the reduce part of MapReduce).An important conclusion from this example is the fact that each chunk of processing doesn't depend on another chunk. For instance, if
thread_1
maps[1, 2, 3]
, andthread_2
maps[4, 5, 6]
, the eventual result of both the threads would still be[2, 4, 6, 8, 10, 12]
but we have halved the processing time for this. The same can be said for the reduce operation and is the essence of how MapReduce works in parallel computing.一直深入到 Map 和 Reduce 的基础知识。
Map 是一种将某种列表中的项目“转换”为另一种类型的项目并将它们放回同一类型列表中的函数。
假设我有一个数字列表:[1,2,3],我想将每个数字加倍,在这种情况下,“每个数字加倍”的函数是函数 x = x * 2。如果没有映射,我可以写一个简单的循环,比如说
,我有 A = [2, 4, 6] 但不是编写循环,如果我有一个映射函数,我可以编写
x =>; x * 2 是针对 [1,2,3] 中的元素执行的函数。 发生的情况是,程序获取每个项目,通过使 x 等于每个项目来对其执行 (x => x * 2),并生成结果列表。
因此,在使用 (x => x * 2) 执行映射函数后,您将得到 [2, 4, 6]。
Reduce 是一个函数,它“收集”列表中的项目并对所有项目执行一些计算,从而将它们减少为单个值。
求和或求平均值都是reduce 函数的实例。 例如,如果您有一个数字列表,例如 [7, 8, 9] 并且您希望将它们相加,您可以编写这样的循环
但是,如果您可以访问reduce 函数,您可以这样编写
现在有点令人困惑的是为什么传递了 2 个参数(0 和带有 x 和 y 的函数)。 要使reduce函数有用,它必须能够获取2个项目,计算一些内容并将这2个项目“减少”为一个值,因此程序可以减少每一对,直到我们得到一个值。
执行将如下:
但是您不想始终从零开始,因此第一个参数可以让您指定种子值,特别是第一
result =
行中的值。假设你想对 2 个列表求和,它可能看起来像这样:
或者你更可能在现实世界中找到的版本:
它在数据库软件中是一件好事,因为有了 Map\Reduce 支持,你可以使用数据库无需知道数据如何存储在数据库中即可使用它,这就是数据库引擎的用途。
您只需要能够通过向引擎提供 Map 或 Reduce 函数来“告诉”引擎您想要什么,然后数据库引擎就可以找到数据周围的方式,应用您的函数,并得出您想要的结果。想要所有,而不知道它如何循环所有记录。
单个数据库可以容纳索引、键、连接和视图以及许多内容,因此通过让您了解数据的实际存储方式,可以使您的代码更易于编写和维护。
并行编程也是如此,如果您只指定要对数据执行的操作而不是实际实现循环代码,那么底层基础设施可以“并行化”并在同步并行循环中为您执行您的函数。
Going all the way down to the basics for Map and Reduce.
Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list.
suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop, say
and I'd have A = [2, 4, 6] but instead of writing loops, if I have a map function I could write
the x => x * 2 is a function to be executed against the elements in [1,2,3]. What happens is that the program takes each item, execute (x => x * 2) against it by making x equals to each item, and produce a list of the results.
so after executing the map function with (x => x * 2) you'd have [2, 4, 6].
Reduce is a function which "collects" the items in lists and perform some computation on all of them, thus reducing them to a single value.
Finding a sum or finding averages are all instances of a reduce function. Such as if you have a list of numbers, say [7, 8, 9] and you want them summed up, you'd write a loop like this
But, if you have access to a reduce function, you could write it like this
Now it's a little confusing why there are 2 arguments (0 and the function with x and y) passed. For a reduce function to be useful, it must be able to take 2 items, compute something and "reduce" that 2 items to just one single value, thus the program could reduce each pair until we have a single value.
the execution would follows:
But you don't want to start with zeroes all the time, so the first argument is there to let you specify a seed value specifically the value in the first
result =
line.say you want to sum 2 lists, it might look like this:
or a version you'd more likely to find in the real world:
Its a good thing in a DB software because, with Map\Reduce support you can work with the database without needing to know how the data are stored in a DB to use it, thats what a DB engine is for.
You just need to be able to "tell" the engine what you want by supplying them with either a Map or a Reduce function and then the DB engine could find its way around the data, apply your function, and come up with the results you want all without you knowing how it loops over all the records.
There are indexes and keys and joins and views and a lot of stuffs a single database could hold, so by shielding you against how the data is actually stored, your code are made easier to write and maintain.
Same goes for parallel programming, if you only specify what you want to do with the data instead of actually implementing the looping code, then the underlying infrastructure could "parallelize" and execute your function in a simultaneous parallel loop for you.
第 2 步是映射。 第三步是减少。
例如,
MapReduce 在 Map 和 Reduce 之间拆分的原因是因为不同的部分可以很容易并行完成。 (特别是如果Reduce具有某些数学属性。)
有关MapReduce的复杂但良好的描述,请参阅:Google 的 MapReduce 编程模型 — 重访 (PDF)。
Step 2 is Map. Step 3 is Reduce.
For example,
The reason MapReduce is split between Map and Reduce is because different parts can easily be done in parallel. (Especially if Reduce has certain mathematical properties.)
For a complex but good description of MapReduce, see: Google's MapReduce Programming Model -- Revisited (PDF).
MapReduce 是一种并行处理大量数据的方法,无需开发人员编写除映射器和化简函数之外的任何代码。
map 函数接收数据并生成结果,该结果保存在屏障中。 该函数可以与大量相同的map任务并行运行。 然后可以将数据集减少为标量值。
因此,如果您将其视为 SQL 语句,
我们可以使用 map 来获取薪资 > 的员工子集。 1000
哪个映射将屏障发送到组大小的桶中。
Reduce 将对每个组求和。 给你你的结果集。
刚刚从我的大学 Google 论文学习笔记中摘录了此内容
MapReduce is a method to process vast sums of data in parallel without requiring the developer to write any code other than the mapper and reduce functions.
The map function takes data in and churns out a result, which is held in a barrier. This function can run in parallel with a large number of the same map task. The dataset can then be reduced to a scalar value.
So if you think of it like a SQL statement
We can use map to get our subset of employees with salary > 1000
which map emits to the barrier into group size buckets.
Reduce will sum each of those groups. Giving you your result set.
just plucked this from my university study notes of the google paper
MAP 和 REDUCE 是古老的 Lisp 函数,从人类杀死最后一只恐龙的时候就开始了。
想象一下,您有一个城市列表,其中包含有关名称、居住在那里的人数以及城市规模的信息:
现在您可能想找到人口密度最高的城市。
首先我们创建一个使用 MAP 列出城市名称和人口密度:
使用 REDUCE 我们现在可以找到人口密度最大的城市。
将这两部分结合起来,我们得到以下代码:
让我们引入函数:
然后我们可以将 MAP REDUCE 代码编写为:
它调用
MAP
和REDUCE
(评估是由内而外的),所以它被称为映射缩减。MAP and REDUCE are old Lisp functions from a time when man killed the last dinosaurs.
Imagine you have a list of cities with informations about the name, number of people living there and the size of the city:
Now you may want to find the city with the highest population density.
First we create a list of city names and population density using MAP:
Using REDUCE we can now find the city with the largest population density.
Combining both parts we get the following code:
Let's introduce functions:
Then we can write our MAP REDUCE code as:
It calls
MAP
andREDUCE
(evaluation is inside out), so it is called map reduce.我们以 Google 论文。 MapReduce 的目标是能够有效地使用并行工作的处理单元负载来执行某种算法。 示例如下:您想要提取一组文档中的所有单词及其计数。
典型实现:
MapReduce 实现:
围绕此,您将拥有一个主程序,它将以“拆分”形式对文档集进行分区,这些文档将在 Map 阶段并行处理。 发出的值由工作线程写入特定于该工作线程的缓冲区中。 一旦主程序得知缓冲区已准备好处理,就会委托其他工作人员执行Reduce 阶段。
每个工作器输出(Map 或Reduce 工作器)实际上是存储在分布式文件系统(Google 的GFS)或CouchDB 的分布式数据库中的文件。
Let's take the example from the Google paper. The goal of MapReduce is to be able to use efficiently a load of processing units working in parallels for some kind of algorithms. The exemple is the following: you want to extract all the words and their count in a set of documents.
Typical implementation:
MapReduce implementation:
Around that, you'll have a master program which will partition the set of documents in "splits" which will be handled in parallel for the Map phase. The emitted values are written by the worker in a buffer specific to the worker. The master program then delegates other workers to perform the Reduce phase as soon as it is notified that the buffer is ready to be handled.
Every worker output (being a Map or a Reduce worker) is in fact a file stored on the distributed file system (GFS for Google) or in the distributed database for CouchDB.
一个真正简单、快速和“傻瓜式”的 MapReduce 介绍位于:http://www.marcolotz.com/?p=67
发布其中的一些内容:
首先,为什么要创建 MapReduce基本上
,谷歌需要一种解决方案来使大型计算作业轻松并行化,从而允许数据分布在通过网络连接的许多机器中。 除此之外,它还必须以透明的方式处理机器故障并管理负载平衡问题。
MapReduce 的真正优势是什么?
有人可能会说 MapReduce 的魔力是基于 Map 和 Reduce 函数的应用。 我必须承认,伙计,我强烈不同意。 MapReduce 如此受欢迎的主要特点是其自动并行化和分布的能力,以及简单的界面。 这些因素加上对大多数错误的透明故障处理使该框架如此受欢迎。
对论文进行更深入的了解:
MapReduce 最初在 Google 论文(Dean & Ghemawat,2004 年 - 链接此处)中提到,作为使用并行方法和商品在大数据中进行计算的解决方案 -计算机集群。 与用 Java 编写的 Hadoop 不同,Google 的框架是用 C++ 编写的。 该文档描述了并行框架如何使用大型数据集上的函数式编程的 Map 和 Reduce 函数来运行。
在此解决方案中,有两个主要步骤 - 称为“映射”和“化简” - 在第一个和第二个步骤之间有一个可选步骤 - 称为“组合”。 Map 步骤将首先运行,在输入键值对中进行计算并生成新的输出键值。 必须记住,输入键值对的格式不一定与输出格式对匹配。 “Reduce”步骤将组装同一键的所有值,并对其执行其他计算。 因此,最后一步将输出键值对。 MapReduce 最简单的应用之一是实现字数统计。
该应用程序的伪代码如下所示:
正如人们所注意到的,地图读取记录中的所有单词(在这种情况下,记录可以是一行)并将该单词作为键和数字 1 作为值。
稍后,reduce 将对同一键的所有值进行分组。 让我们举个例子:假设“house”这个词在记录中出现了 3 次。 减速器的输入为 [house,[1,1,1]]。 在reducer中,它将对key house的所有值进行求和,并给出以下键值作为输出:[house,[3]]。
下面是 MapReduce 框架中的图像:
> MapReduce 应用程序的示例,可以这样说:
• URL 访问频率计数
• 反向 Web 链接图
• 分布式 Grep
• 每个主机的术语向量
为了避免过多的网络流量,本文描述了框架应如何尝试维护数据地点。 这意味着它应该始终尝试确保运行 Map 作业的机器在其内存/本地存储中拥有数据,避免从网络获取数据。 为了减少映射器的网络吞吐量,使用了前面描述的可选组合器步骤。 组合器对给定机器中映射器的输出进行计算,然后将其发送到可能位于另一台机器中的减速器。
该文档还描述了框架元素在发生故障时应如何表现。 在本文中,这些元素被称为工人和主人。 在开源实现中它们将被分为更具体的元素。
由于谷歌仅在论文中描述了该方法,并未发布其专有软件,因此创建了许多开源框架来实现该模型。 作为示例,人们可能会说 Hadoop 或 MongoDB 中有限的 MapReduce 功能。
运行时应该照顾非专家程序员的细节,例如分区输入数据、在大量机器上调度程序执行、处理机器故障(当然以透明的方式)以及管理机器间通信。 有经验的用户可以调整这些参数,例如如何在工作人员之间分配输入数据。
关键概念:
•容错:它必须能够优雅地容忍机器故障。 为了执行此操作,主节点会定期对工作节点执行 ping 操作。 如果master在一定时间内没有收到来自给定worker的响应,master将定义该worker中的工作失败。 在这种情况下,由故障工作人员完成的所有地图任务都将被丢弃并交给另一个可用的工作人员。 如果工作人员仍在处理映射或归约任务,则会发生类似的情况。 请注意,如果工作线程已经完成了它的归约部分,则在失败时所有计算都已经完成,不需要重置。 作为主要故障点,如果 master 发生故障,则所有作业都会失败。 为此,可以为master定义周期性检查点,以保存其数据结构。 最后一个检查点和主节点故障之间发生的所有计算都会丢失。
•局部性:为了避免网络流量,框架尝试确保所有输入数据对于要对其执行计算的机器来说都是本地可用的。 在原始描述中,它使用 Google 文件系统 (GFS),复制因子设置为 3,块大小为 64 MB。 这意味着同一个 64 MB 块(在文件系统中构成文件)将在三台不同的计算机中具有相同的副本。 主设备知道块在哪里,并尝试在该机器上安排映射作业。 如果失败,主机会尝试在任务输入数据的副本附近分配一台机器(即数据机器同一机架中的工作机器)。
•任务粒度:假设每个map阶段分为M块,每个Reduce阶段分为R块,理想的情况是M和R远大于worker机器的数量。 这是因为执行许多不同任务的工作人员可以改善动态负载平衡。 除此之外,它还提高了工作线程失败时的恢复速度(因为它已完成的许多映射任务可以分布在所有其他机器上)。
•备份任务:有时,Map 或Reducer 工作线程可能比集群中的其他工作线程慢很多。 这可以保存总处理时间并使其等于该单个慢速机器的处理时间。 原始论文描述了一种称为备份任务的替代方案,该任务由主服务器在 MapReduce 操作接近完成时进行调度。 这些是由正在进行的任务的主控安排的任务。 因此,当主数据库或备份数据库完成时,MapReduce 操作也完成。
•计数器:有时人们可能希望对事件的发生次数进行计数。 因此,创建地点才算数。 每个工作进程中的计数器值会定期传播到主进程。 然后,master 聚合(是的。看起来 Pregel 聚合器来自这个地方)成功的 Map 和 Reduce 任务的计数器值,并在 MapReduce 操作完成时将它们返回给用户代码。 主状态中还有一个可用的当前计数器值,因此观察进程的人可以跟踪它的行为方式。
好吧,我想有了上面的所有概念,Hadoop 对你来说就是小菜一碟了。 如果您对原始 MapReduce 文章或任何相关内容有任何疑问,请告诉我。
A really easy, quick and "for dummies" introduction to MapReduce is available at: http://www.marcolotz.com/?p=67
Posting some of it's content:
First of all, why was MapReduce originally created?
Basically Google needed a solution for making large computation jobs easily parallelizable, allowing data to be distributed in a number of machines connected through a network. Aside from that, it had to handle the machine failure in a transparent way and manage load balancing issues.
What are MapReduce true strengths?
One may say that MapReduce magic is based on the Map and Reduce functions application. I must confess mate, that I strongly disagree. The main feature that made MapReduce so popular is its capability of automatic parallelization and distribution, combined with the simple interface. These factor summed with transparent failure handling for most of the errors made this framework so popular.
A little more depth on the paper:
MapReduce was originally mentioned in a Google paper (Dean & Ghemawat, 2004 – link here) as a solution to make computations in Big Data using a parallel approach and commodity-computer clusters. In contrast to Hadoop, that is written in Java, the Google’s framework is written in C++. The document describes how a parallel framework would behave using the Map and Reduce functions from functional programming over large data sets.
In this solution there would be two main steps – called Map and Reduce –, with an optional step between the first and the second – called Combine. The Map step would run first, do computations in the input key-value pair and generate a new output key-value. One must keep in mind that the format of the input key-value pairs does not need to necessarily match the output format pair. The Reduce step would assemble all values of the same key, performing other computations over it. As a result, this last step would output key-value pairs. One of the most trivial applications of MapReduce is to implement word counts.
The pseudo-code for this application, is given bellow:
As one can notice, the map reads all the words in a record (in this case a record can be a line) and emits the word as a key and the number 1 as a value.
Later on, the reduce will group all values of the same key. Let’s give an example: imagine that the word ‘house’ appears three times in the record. The input of the reducer would be [house,[1,1,1]]. In the reducer, it will sum all the values for the key house and give as an output the following key value: [house,[3]].
Here’s an image of how this would look like in a MapReduce framework:
As a few other classical examples of MapReduce applications, one can say:
•Count of URL access frequency
•Reverse Web-link Graph
•Distributed Grep
•Term Vector per host
In order to avoid too much network traffic, the paper describes how the framework should try to maintain the data locality. This means that it should always try to make sure that a machine running Map jobs has the data in its memory/local storage, avoiding to fetch it from the network. Aiming to reduce the network through put of a mapper, the optional combiner step, described before, is used. The Combiner performs computations on the output of the mappers in a given machine before sending it to the Reducers – that may be in another machine.
The document also describes how the elements of the framework should behave in case of faults. These elements, in the paper, are called as worker and master. They will be divided into more specific elements in open-source implementations.
Since the Google has only described the approach in the paper and not released its proprietary software, many open-source frameworks were created in order to implement the model. As examples one may say Hadoop or the limited MapReduce feature in MongoDB.
The run-time should take care of non-expert programmers details, like partitioning the input data, scheduling the program execution across the large set of machines, handling machines failures (in a transparent way, of course) and managing the inter-machine communication. An experienced user may tune these parameters, as how the input data will be partitioned between workers.
Key Concepts:
•Fault Tolerance: It must tolerate machine failure gracefully. In order to perform this, the master pings the workers periodically. If the master does not receive responses from a given worker in a definite time lapse, the master will define the work as failed in that worker. In this case, all map tasks completed by the faulty worker are thrown away and are given to another available worker. Similar happens if the worker was still processing a map or a reduce task. Note that if the worker already completed its reduce part, all computation was already finished by the time it failed and does not need to be reset. As a primary point of failure, if the master fails, all the job fails. For this reason, one may define periodical checkpoints for the master, in order to save its data structure. All computations that happen between the last checkpoint and the master failure are lost.
•Locality: In order to avoid network traffic, the framework tries to make sure that all the input data is locally available to the machines that are going to perform computations on them. In the original description, it uses Google File System (GFS) with replication factor set to 3 and block sizes of 64 MB. This means that the same block of 64 MB (that compose a file in the file system) will have identical copies in three different machines. The master knows where are the blocks and try to schedule map jobs in that machine. If that fails, the master tries to allocate a machine near a replica of the tasks input data (i.e. a worker machine in the same rack of the data machine).
•Task Granularity: Assuming that each map phase is divided into M pieces and that each Reduce phase is divided into R pieces, the ideal would be that M and R are a lot larger than the number of worker machines. This is due the fact that a worker performing many different tasks improves dynamic load balancing. Aside from that, it increases the recovery speed in the case of worker fail (since the many map tasks it has completed can be spread out across all the other machines).
•Backup Tasks: Sometimes, a Map or Reducer worker may behave a lot more slow than the others in the cluster. This may hold the total processing time and make it equal to the processing time of that single slow machine. The original paper describes an alternative called Backup Tasks, that are scheduled by the master when a MapReduce operation is close to completion. These are tasks that are scheduled by the Master of the in-progress tasks. Thus, the MapReduce operation completes when the primary or the backup finishes.
•Counters: Sometimes one may desire to count events occurrences. For this reason, counts where created. The counter values in each workers are periodically propagated to the master. The master then aggregates (Yep. Looks like Pregel aggregators came from this place ) the counter values of a successful map and reduce task and return them to the user code when the MapReduce operation is complete. There is also a current counter value available in the master status, so a human watching the process can keep track of how it is behaving.
Well, I guess with all the concepts above, Hadoop will be a piece of cake for you. If you have any question about the original MapReduce article or anything related, please let me know.