mapReduce-计数文档编号一个单词apears
我是MapRecude的新手,并试图扩展字数程序。我想计算一个单词出现多少个文档。 示例:如果我有3个文档,并且文档3中的文档1和5次“ try”一词。我希望最终计数为2。我
不确定如何做到这一点,我尝试了WritableBlecomable课程是我的映射器中的钥匙,但是当我尝试用班级替换钥匙时,我会遇到错误,因此我放弃了它。我目前正在尝试为密钥具有文本变量,并给出“ word + document name”的值。
这是我到目前为止
代码
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class wcount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Text fileName = new Text();
private String tokens = "[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"”“]";
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String cleanValue = value.toString().toLowerCase().replaceAll(tokens, " ");
String filePathString = ((FileSplit) reporter.getInputSplit()).getPath().getName().toString();
fileName.set(new Text(filePathString));
String line = cleanValue.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
Text k = new Text(word + " " + fileName);
output.collect(k, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String[] new_key = key.toString().split(" ");
Text word = new Text();
Text FileName = new Text();
word.set(new_key[0]);
//FileName.set(new_key[1]); //error here
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(FileName, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(wcount.class);
conf.setJobName("wcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setNumReduceTasks(1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
在我的还原器中试图将键分开的的内容,但是“ filename.set(new_key [1]);”;在给我的情况下,例外。
我想知道是否可以使用1次MapReduce进行此操作,或者我必须有第二次。一个例子将不胜感激。
I am new in MapRecude and trying to extend the word count program. I want to count in how many documents a word appears.
Example: If i have 3 documents and the word "Try" apears 3 times in document 1 and 5 times in document 3. I want the final count to be 2.
I am not really sure how to do this, i have tried the writablecomparable class as a Key in my mapper, but i am getting errors when i try to replace the key with the class so i abandoned it. I am currently trying to have a Text variable for the Key, and give the value "word + Document Name".
Here is what i have so far
CODE
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class wcount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Text fileName = new Text();
private String tokens = "[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"”“]";
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String cleanValue = value.toString().toLowerCase().replaceAll(tokens, " ");
String filePathString = ((FileSplit) reporter.getInputSplit()).getPath().getName().toString();
fileName.set(new Text(filePathString));
String line = cleanValue.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
Text k = new Text(word + " " + fileName);
output.collect(k, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String[] new_key = key.toString().split(" ");
Text word = new Text();
Text FileName = new Text();
word.set(new_key[0]);
//FileName.set(new_key[1]); //error here
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(FileName, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(wcount.class);
conf.setJobName("wcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setNumReduceTasks(1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
In my reducer trying to seperate the key in 2 strings but the "FileName.set(new_key[1]);" in giving me out of Bounds Exception.
I want to know if its possibol to do this with 1 run of MapReduce or i have to have a second. An example would be much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
验证您的输入
您可能还需要考虑使用长图用于大量计数,或BigInteger值的文本输出
Validate your inputs
You might also want to consider using LongWritable for large counts, or a Text output from a BigInteger value
我正在为遇到相同问题的任何人发布代码。
I am posting the code for anyone having the same problem.