如何使用新的 Hadoop API 来使用 MultipleTextOutputFormat?

发布于 2024-11-14 15:19:09 字数 48 浏览 1 评论 0原文

我想编写多个输出文件。 如何使用 Job 而不是 JobConf 来执行此操作?

I would like to write multiple output files.
How do I do this using Job instead of JobConf?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

柠檬色的秋千 2024-11-21 15:19:09

创建基于密钥的输出文件名的简单方法

 input data type

  //key        //value
 cupertino   apple
 sunnyvale   banana
 cupertino   pear

MultipleTextOutputFormat 类

static class KeyBasedMultipleTextOutputForma extends MultipleTextOutputFormat<Text, Text> {
    @Override
    protected String generateFileNameForKeyValue(Text key, Text value, String name) {
        return key.toString() + "/" + name;
    }
} 

作业配置

 job.setOutputFormat(KeyBasedMultipleTextOutputFormat.class);

运行此代码,您将在 HDFS 中看到以下文件,其中 /output 是作业输出目录:

 $ hadoop fs -ls /output
 /output/cupertino/part-00000
 /output/sunnyvale/part-00000

希望它有所帮助。

an easy way to to create key based output file names

 input data type

  //key        //value
 cupertino   apple
 sunnyvale   banana
 cupertino   pear

MultipleTextOutputFormat class

static class KeyBasedMultipleTextOutputForma extends MultipleTextOutputFormat<Text, Text> {
    @Override
    protected String generateFileNameForKeyValue(Text key, Text value, String name) {
        return key.toString() + "/" + name;
    }
} 

job config

 job.setOutputFormat(KeyBasedMultipleTextOutputFormat.class);

Run this code and you’ll see the following files in HDFS, where /output is the job output directory:

 $ hadoop fs -ls /output
 /output/cupertino/part-00000
 /output/sunnyvale/part-00000

hopes it helps.

再见回来 2024-11-21 15:19:09

文档说使用 org.apache.hadoop.mapreduce.lib.output.MultipleOutputs 代替。

下面是使用 MultipleOutputs 的代码片段。不幸的是,我没有写它,也没有花太多时间来处理它......所以我不知道为什么事情会在哪里。我分享希望它有帮助。 :)

作业设置

job.setJobName("Job Name");
job.setJarByClass(ETLManager.class);
job.setMapOutputKeyClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(MyThing.class);
job.setMapperClass(MyThingMapper.class);
job.setReducerClass(MyThingReducer.class);
MultipleOutputs.addNamedOutput(job, Constants.MyThing_NAMED_OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class);
job.setInputFormatClass(MyInputFormat.class);
FileInputFormat.addInputPath(job, new Path(conf.get("input")));
FileOutputFormat.setOutputPath(job, new Path(String.format("%s/%s", conf.get("output"), Constants.MyThing_NAMED_OUTPUT)));

减速器设置

public class MyThingReducer extends
    Reducer<Text, MyThing, NullWritable, NullWritable> {
    private MultipleOutputs     m_multipleOutputs;

     @Override
    public void setup(Context context) {
        m_multipleOutputs = new MultipleOutputs(context);
    }
    @Override
    public void cleanup(Context context) throws IOException,
            InterruptedException {
        if (m_multipleOutputs != null) {
            m_multipleOutputs.close();
        }
    }

    @Override
    public void reduce(Text key, Iterable<MyThing> values, Context context)throws IOException, InterruptedException {
        for (MyThing myThing : values) {
            m_multipleOutputs.write(Constants.MyThing_NAMED_OUTPUT, EMPTY_KEY, generateData(context, myThing), generateFileName(context, myThing));
            context.progress();
        }
    }
}

编辑:添加了到 MultipleOutputs 的链接。

The docs say to use org.apache.hadoop.mapreduce.lib.output.MultipleOutputs instead.

Below is a snippet of code that uses MultipleOutputs. Unfortunately I didn't write it and haven't spent much time with it... So I don't know exactly why things are where. I share with the hopes it helps. :)

Job Setup

job.setJobName("Job Name");
job.setJarByClass(ETLManager.class);
job.setMapOutputKeyClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(MyThing.class);
job.setMapperClass(MyThingMapper.class);
job.setReducerClass(MyThingReducer.class);
MultipleOutputs.addNamedOutput(job, Constants.MyThing_NAMED_OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class);
job.setInputFormatClass(MyInputFormat.class);
FileInputFormat.addInputPath(job, new Path(conf.get("input")));
FileOutputFormat.setOutputPath(job, new Path(String.format("%s/%s", conf.get("output"), Constants.MyThing_NAMED_OUTPUT)));

Reducer Setup

public class MyThingReducer extends
    Reducer<Text, MyThing, NullWritable, NullWritable> {
    private MultipleOutputs     m_multipleOutputs;

     @Override
    public void setup(Context context) {
        m_multipleOutputs = new MultipleOutputs(context);
    }
    @Override
    public void cleanup(Context context) throws IOException,
            InterruptedException {
        if (m_multipleOutputs != null) {
            m_multipleOutputs.close();
        }
    }

    @Override
    public void reduce(Text key, Iterable<MyThing> values, Context context)throws IOException, InterruptedException {
        for (MyThing myThing : values) {
            m_multipleOutputs.write(Constants.MyThing_NAMED_OUTPUT, EMPTY_KEY, generateData(context, myThing), generateFileName(context, myThing));
            context.progress();
        }
    }
}

EDIT: Added link to MultipleOutputs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文