mapReduce-计数文档编号一个单词apears

发布于 2025-02-03 17:54:45 字数 3612 浏览 4 评论 0原文

我是MapRecude的新手，并试图扩展字数程序。我想计算一个单词出现多少个文档。示例：如果我有3个文档，并且文档3中的文档1和5次“ try”一词。我希望最终计数为2。我

不确定如何做到这一点，我尝试了WritableBlecomable课程是我的映射器中的钥匙，但是当我尝试用班级替换钥匙时，我会遇到错误，因此我放弃了它。我目前正在尝试为密钥具有文本变量，并给出“ word + document name”的值。

这是我到目前为止

代码

    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    import java.util.*;
    
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    import org.apache.hadoop.util.*;
    
    public class wcount {
        public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();
            private Text fileName = new Text();
            
            private String tokens = "[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"”“]";
            
            public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
                String cleanValue = value.toString().toLowerCase().replaceAll(tokens, " ");
                
                String filePathString = ((FileSplit) reporter.getInputSplit()).getPath().getName().toString();
                fileName.set(new Text(filePathString));
                
                String line = cleanValue.toString();
                StringTokenizer tokenizer = new StringTokenizer(line);
                while (tokenizer.hasMoreTokens()) {
                    word.set(tokenizer.nextToken());
                    Text k = new Text(word + " " + fileName);
                    output.collect(k, one);
                }
            }
        }
    
        public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {         
            public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
                String[] new_key = key.toString().split(" ");
                
                Text word = new Text();
                Text FileName = new Text();
                
                word.set(new_key[0]);
                //FileName.set(new_key[1]); //error here
                
                int sum = 0;
                while (values.hasNext()) {
                    sum += values.next().get();
                }
                output.collect(FileName, new IntWritable(sum));
            }
        }
        
        public static void main(String[] args) throws Exception {
            
            JobConf conf = new JobConf(wcount.class);
            conf.setJobName("wcount");
    
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);
            
            conf.setMapperClass(Map.class);
            conf.setCombinerClass(Reduce.class);
            conf.setReducerClass(Reduce.class);
    
            conf.setInputFormat(TextInputFormat.class);
            conf.setOutputFormat(TextOutputFormat.class);
    
            conf.setNumReduceTasks(1);
    
            FileInputFormat.setInputPaths(conf, new Path(args[0]));
            FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    
            JobClient.runJob(conf);
        }
    }

在我的还原器中试图将键分开的的内容，但是“ filename.set（new_key [1]）;”;在给我的情况下，例外。

我想知道是否可以使用1次MapReduce进行此操作，或者我必须有第二次。一个例子将不胜感激。

原文

I am new in MapRecude and trying to extend the word count program. I want to count in how many documents a word appears.
Example: If i have 3 documents and the word "Try" apears 3 times in document 1 and 5 times in document 3. I want the final count to be 2.

I am not really sure how to do this, i have tried the writablecomparable class as a Key in my mapper, but i am getting errors when i try to replace the key with the class so i abandoned it. I am currently trying to have a Text variable for the Key, and give the value "word + Document Name".

Here is what i have so far

CODE

    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException;
    import java.util.*;
    
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    import org.apache.hadoop.util.*;
    
    public class wcount {
        public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();
            private Text fileName = new Text();
            
            private String tokens = "[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"”“]";
            
            public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
                String cleanValue = value.toString().toLowerCase().replaceAll(tokens, " ");
                
                String filePathString = ((FileSplit) reporter.getInputSplit()).getPath().getName().toString();
                fileName.set(new Text(filePathString));
                
                String line = cleanValue.toString();
                StringTokenizer tokenizer = new StringTokenizer(line);
                while (tokenizer.hasMoreTokens()) {
                    word.set(tokenizer.nextToken());
                    Text k = new Text(word + " " + fileName);
                    output.collect(k, one);
                }
            }
        }
    
        public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {         
            public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
                String[] new_key = key.toString().split(" ");
                
                Text word = new Text();
                Text FileName = new Text();
                
                word.set(new_key[0]);
                //FileName.set(new_key[1]); //error here
                
                int sum = 0;
                while (values.hasNext()) {
                    sum += values.next().get();
                }
                output.collect(FileName, new IntWritable(sum));
            }
        }
        
        public static void main(String[] args) throws Exception {
            
            JobConf conf = new JobConf(wcount.class);
            conf.setJobName("wcount");
    
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);
            
            conf.setMapperClass(Map.class);
            conf.setCombinerClass(Reduce.class);
            conf.setReducerClass(Reduce.class);
    
            conf.setInputFormat(TextInputFormat.class);
            conf.setOutputFormat(TextOutputFormat.class);
    
            conf.setNumReduceTasks(1);
    
            FileInputFormat.setInputPaths(conf, new Path(args[0]));
            FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    
            JobClient.runJob(conf);
        }
    }

In my reducer trying to seperate the key in 2 strings but the "FileName.set(new_key[1]);" in giving me out of Bounds Exception.

I want to know if its possibol to do this with 1 run of MapReduce or i have to have a second. An example would be much appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

桜花祭 2025-02-10 17:54:45

验证您的输入

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {         
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        String[] new_key = key.toString().split("\\s");
        if (new_key.length >= 2) {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(new Text(new_key[1]), new IntWritable(sum));
         } 
     } else {
         System.out.printf("Unexpected data: \"%s\"%n", key);
     } 
}

您可能还需要考虑使用长图用于大量计数，或BigInteger值的文本输出

Validate your inputs

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {         
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        String[] new_key = key.toString().split("\\s");
        if (new_key.length >= 2) {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(new Text(new_key[1]), new IntWritable(sum));
         } 
     } else {
         System.out.printf("Unexpected data: \"%s\"%n", key);
     } 
}

You might also want to consider using LongWritable for large counts, or a Text output from a BigInteger value

回复收藏 0 原文

浪菊怪哟 2025-02-10 17:54:45

我正在为遇到相同问题的任何人发布代码。

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class wcount {
    
    public static class Map extends Mapper<LongWritable, Text, Text, Text> {
        //private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        private Text fileName = new Text();
        
        private String tokens = "[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"”“]";
        
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String cleanValue = value.toString().toLowerCase().replaceAll(tokens, " ");
            
            String filePathString = ((FileSplit) context.getInputSplit()).getPath().getName().toString();
            fileName.set(new Text(filePathString));
            
            String line = cleanValue.toString();
            
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, fileName);
            }
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, IntWritable> {
        
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {          
            String tempWord = "";
            IntWritable tempvalue = new IntWritable(0);
            
            int sum = 0;
            for (Text value : values) {
                if(!value.toString().trim().equals(tempWord)) {
                    sum += 1;
                    tempWord = value.toString().trim();
                }
            }
            
            tempvalue.set(sum);
            context.write(key, tempvalue);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Doccount");
        job.setJarByClass(wcount.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setNumReduceTasks(1);

        FileInputFormat .setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean success = job.waitForCompletion(true);
        System.out.println(success);
    }
}

I am posting the code for anyone having the same problem.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class wcount {
    
    public static class Map extends Mapper<LongWritable, Text, Text, Text> {
        //private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        private Text fileName = new Text();
        
        private String tokens = "[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"”“]";
        
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String cleanValue = value.toString().toLowerCase().replaceAll(tokens, " ");
            
            String filePathString = ((FileSplit) context.getInputSplit()).getPath().getName().toString();
            fileName.set(new Text(filePathString));
            
            String line = cleanValue.toString();
            
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, fileName);
            }
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, IntWritable> {
        
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {          
            String tempWord = "";
            IntWritable tempvalue = new IntWritable(0);
            
            int sum = 0;
            for (Text value : values) {
                if(!value.toString().trim().equals(tempWord)) {
                    sum += 1;
                    tempWord = value.toString().trim();
                }
            }
            
            tempvalue.set(sum);
            context.write(key, tempvalue);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Doccount");
        job.setJarByClass(wcount.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setNumReduceTasks(1);

        FileInputFormat .setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean success = job.waitForCompletion(true);
        System.out.println(success);
    }
}

回复收藏 0 原文

~没有更多了~