使用 Hadoop MapReduce 排序字数

发布于 2024-08-27 08:46:39 字数 129 浏览 3 评论 0原文

我对 MapReduce 非常陌生,并且完成了 Hadoop 字数统计示例。

在该示例中,它生成未排序的字数文件(带有键值对)。那么是否可以通过将另一个 MapReduce 任务与之前的任务结合起来,按单词出现次数对其进行排序?

I'm very much new to MapReduce and I completed a Hadoop word-count example.

In that example it produces unsorted file (with key-value pairs) of word counts. So is it possible to sort it by number of word occurrences by combining another MapReduce task with the earlier one?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

始终不够 2024-09-03 08:46:39

在简单的字数统计映射缩减程序中,我们得到的输出是按单词排序的。示例输出可以是:
苹果1
男孩 30
猫 2
青蛙20
斑马1
如果您希望输出根据单词出现次数进行排序,即采用以下格式
1 个苹果
1 斑马
2 猫
20 青蛙
30 男孩
您可以使用下面的映射器和化简器创建另一个 MR 程序,其中输入将是从简单字数统计程序获得的输出。

class Map1 extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text>
{
    public void map(Object key, Text value, OutputCollector<IntWritable, Text> collector, Reporter arg3) throws IOException 
    {
        String line = value.toString();
        StringTokenizer stringTokenizer = new StringTokenizer(line);
        {
            int number = 999; 
            String word = "empty";

            if(stringTokenizer.hasMoreTokens())
            {
                String str0= stringTokenizer.nextToken();
                word = str0.trim();
            }

            if(stringTokenizer.hasMoreElements())
            {
                String str1 = stringTokenizer.nextToken();
                number = Integer.parseInt(str1.trim());
            }

            collector.collect(new IntWritable(number), new Text(word));
        }

    }

}


class Reduce1 extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>
{
    public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> arg2, Reporter arg3) throws IOException
    {
        while((values.hasNext()))
        {
            arg2.collect(key, values.next());
        }

    }

}

In simple word count map reduce program the output we get is sorted by words. Sample output can be :

Apple 1

Boy 30

Cat 2

Frog 20

Zebra 1

If you want output to be sorted on the basis of number of occrance of words, i.e in below format

1 Apple

1 Zebra

2 Cat

20 Frog

30 Boy

You can create another MR program using below mapper and reducer where the input will be the output got from simple word count program.

class Map1 extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text>
{
    public void map(Object key, Text value, OutputCollector<IntWritable, Text> collector, Reporter arg3) throws IOException 
    {
        String line = value.toString();
        StringTokenizer stringTokenizer = new StringTokenizer(line);
        {
            int number = 999; 
            String word = "empty";

            if(stringTokenizer.hasMoreTokens())
            {
                String str0= stringTokenizer.nextToken();
                word = str0.trim();
            }

            if(stringTokenizer.hasMoreElements())
            {
                String str1 = stringTokenizer.nextToken();
                number = Integer.parseInt(str1.trim());
            }

            collector.collect(new IntWritable(number), new Text(word));
        }

    }

}


class Reduce1 extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text>
{
    public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> arg2, Reporter arg3) throws IOException
    {
        while((values.hasNext()))
        {
            arg2.collect(key, values.next());
        }

    }

}
源来凯始玺欢你 2024-09-03 08:46:39

Hadoop MapReduce wordcount 示例的输出按键排序。所以输出应该按字母顺序排列。

使用 Hadoop,您可以创建自己的关键对象来实现 WritableComparable 接口,从而允许您重写compareTo 方法。这允许您控制排序顺序。

要创建按出现次数排序的输出,您可能需要添加另一个 MapReduce 作业来处理第一个作业的输出,正如您所说的那样。第二项工作将非常简单,甚至可能不需要减少阶段。您只需要实现自己的 Writable 键对象来包装单词及其频率。自定义可写看起来像这样:

 public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;

       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }

       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }

       public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }
     }

我从 此处

您可能还应该重写 hashCodeequalstoString

The output from the Hadoop MapReduce wordcount example is sorted by the key. So the output should be in alphabetical order.

With Hadoop you can create your own key objects that implement the WritableComparable interface allowing you to override the compareTo method. This allows you to control the sort order.

To create an output that is sorted by the number of occurances you would probably have to add another MapReduce job to process the output from the first as you have said. This second job would be very simple, maybe not even requiring a reduce phase. You would just need to implement your own Writable key object to wrap the word and its frequency. A custom writable looks something like this:

 public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;

       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }

       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }

       public int compareTo(MyWritableComparable w) {
         int thisValue = this.value;
         int thatValue = ((IntWritable)o).value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }
     }

I grabbed this example from here.

You should probably override hashCode, equals and toString as well.

ぃ双果 2024-09-03 08:46:39

在 Hadoop 中,排序是在 Map 和 Reduce 阶段之间完成的。按单词出现次数排序的一种方法是使用不对任何内容进行分组的自定义组比较器;因此,每次调用reduce都只是一个键和一个值。

public class Program {
   public static void main( String[] args) {

      conf.setOutputKeyClass( IntWritable.class);
      conf.setOutputValueClass( Text.clss);
      conf.setMapperClass( Map.class);
      conf.setReducerClass( IdentityReducer.class);
      conf.setOutputValueGroupingComparator( GroupComparator.class);   
      conf.setNumReduceTasks( 1);
      JobClient.runJob( conf);
   }
}

public class Map extends MapReduceBase implements Mapper<Text,IntWritable,IntWritable,Text> {

   public void map( Text key, IntWritable value, OutputCollector<IntWritable,Text>, Reporter reporter) {
       output.collect( value, key);
   }
}

public class GroupComaprator extends WritableComparator {
    protected GroupComparator() {
        super( IntWritable.class, true);
    }

    public int compare( WritableComparable w1, WritableComparable w2) {
        return -1;
    }
}

In Hadoop sorting is done between the Map and the Reduce phases. One approach to sort by word occurance would be to use a custom group comparator that doesn't group anything; therefore, every call to reduce is just the key and one value.

public class Program {
   public static void main( String[] args) {

      conf.setOutputKeyClass( IntWritable.class);
      conf.setOutputValueClass( Text.clss);
      conf.setMapperClass( Map.class);
      conf.setReducerClass( IdentityReducer.class);
      conf.setOutputValueGroupingComparator( GroupComparator.class);   
      conf.setNumReduceTasks( 1);
      JobClient.runJob( conf);
   }
}

public class Map extends MapReduceBase implements Mapper<Text,IntWritable,IntWritable,Text> {

   public void map( Text key, IntWritable value, OutputCollector<IntWritable,Text>, Reporter reporter) {
       output.collect( value, key);
   }
}

public class GroupComaprator extends WritableComparator {
    protected GroupComparator() {
        super( IntWritable.class, true);
    }

    public int compare( WritableComparable w1, WritableComparable w2) {
        return -1;
    }
}
暗喜 2024-09-03 08:46:39

正如您所说,一种可能性是编写两个作业来完成此操作。
第一份工作:
简单的字数统计示例

第二份工作:
进行排序部分。

伪代码可以是:

注意:第一个作业生成的输出文件将作为第二个作业的输入

    Mapper2(String _key, Intwritable _value){
    //just reverse the position of _value and _key. This is useful because reducer will get the output in the sorted and shuffled manner.
    emit(_value,_key);
    }

    Reduce2(IntWritable valueofMapper2,Iterable<String> keysofMapper2){
//At the reducer side, all the keys that have the same count are merged together.
        for each K in keysofMapper2{
        emit(K,valueofMapper2); //This will sort in ascending order.
        }

    }

您还可以按降序排序,为此编写一个单独的比较器类是可行的会成功的。
在作业中包含比较器,如下所示:

Job.setComparatorclass(Comparator.class);

该比较器将在发送到减速器端之前按降序对值进行排序。因此,在减速器上,您只需发出值。

As you have said, one possibility is to write two jobs to do this.
First job:
Simple wordcount example

Second job:
Does the sorting part.

The pseudo code could be:

Note : The output file generated by the first job will be the input for the second job

    Mapper2(String _key, Intwritable _value){
    //just reverse the position of _value and _key. This is useful because reducer will get the output in the sorted and shuffled manner.
    emit(_value,_key);
    }

    Reduce2(IntWritable valueofMapper2,Iterable<String> keysofMapper2){
//At the reducer side, all the keys that have the same count are merged together.
        for each K in keysofMapper2{
        emit(K,valueofMapper2); //This will sort in ascending order.
        }

    }

You can also sort in descending order for which it is feasible to write a separate comparator class which will do the trick.
Include comparator inside the job as:

Job.setComparatorclass(Comparator.class);

This comparator will sort the values in descending order before sending to the reducer side. So on the reducer, you just emit the values.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文