如何将 .txt 文件转换为 Hadoop 的序列文件格式

发布于 2024-10-25 06:00:35 字数 287 浏览 5 评论 0原文

为了有效利用Hadoop中的map-reduce作业,我需要将数据存储在hadoop 的序列文件格式。但是,目前数据仅采用平面 .txt 格式。任何人都可以建议一种将 .txt 文件转换为序列文件的方法吗?

To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop's sequence file format. However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

爱情眠于流年 2024-11-01 06:00:35

因此,更简单的答案就是一个具有 SequenceFile 输出的“身份”作业。

在java中看起来像这样:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));

    // submit and wait for completion
    job.waitForCompletion(true);
   }

So the way more simplest answer is just an "identity" job that has a SequenceFile output.

Looks like this in java:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));

    // submit and wait for completion
    job.waitForCompletion(true);
   }
留蓝 2024-11-01 06:00:35
import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main( String[] args) throws IOException { 
        String uri = args[ 0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create( uri), conf);
        Path path = new Path( uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try { 
            writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
            for (int i = 0; i < 100; i ++) { 
                key.set( 100 - i);
                value.set( DATA[ i % DATA.length]);
                System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                writer.append( key, value); } 
        } finally 
        { IOUtils.closeStream( writer); 
        } 
    } 
}
import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main( String[] args) throws IOException { 
        String uri = args[ 0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create( uri), conf);
        Path path = new Path( uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try { 
            writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
            for (int i = 0; i < 100; i ++) { 
                key.set( 100 - i);
                value.set( DATA[ i % DATA.length]);
                System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                writer.append( key, value); } 
        } finally 
        { IOUtils.closeStream( writer); 
        } 
    } 
}
橘亓 2024-11-01 06:00:35

这取决于 TXT 文件的格式。每条记录一行吗?如果是这样,您可以简单地使用 TextInputFormat 为每一行创建一个记录。在映射器中,您可以解析该行并以您选择的方式使用它。

如果不是每条记录一行,您可能需要编写自己的 InputFormat 实现。请参阅本教程了解更多信息。

It depends on what the format of the TXT file is. Is it one line per record? If so, you can simply use TextInputFormat which creates one record for each line. In your mapper you can parse that line and use it whichever way you choose.

If it isn't one line per record, you might need to write your own InputFormat implementation. Take a look at this tutorial for more info.

您还可以创建一个中间表,将 csv 内容直接加载到其中,然后创建第二个表作为序列文件(分区、集群等)并从中间表中插入选择。您还可以设置压缩选项,例如,

set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

create table... stored as sequencefile;

insert overwrite table ... select * from ...;

MR 框架将为您处理繁重的工作,省去您编写 Java 代码的麻烦。

You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,

set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

create table... stored as sequencefile;

insert overwrite table ... select * from ...;

The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.

愛放△進行李 2024-11-01 06:00:35

请注意格式说明符 :

例如(注意%s之间的空格),System.out.printf("[%s]\t%s\t%s\n ", writer.getLength(), key, value); 将为我们提供 java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

相反,我们应该使用:

System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value); 

Be watchful with format specifier :.

For example (note the space between % and s), System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

Instead, we should use:

System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value); 
半仙 2024-11-01 06:00:35

如果您的数据不在HDFS上,则需要将其上传到HDFS。两个选项:

i) hdfs - 放在您的 .txt 文件上,一旦您将其放在 HDFS 上,您可以将其转换为 seq 文件。

ii) 您将文本文件作为 HDFS 客户端框的输入,并通过创建 SequenceFile.Writer 并向其附加 (key,values),使用序列文件 API 将文本文件转换为 SeqFile。

如果你不关心键,你可以将行号作为键,将完整的文本作为值。

If your data is not on HDFS, you need to upload it to HDFS. Two options:

i) hdfs -put on your .txt file and once you get it on HDFS, you can convert it to seq file.

ii) You take text file as input on your HDFS Client box and convert to SeqFile using Sequence File APIs by creating a SequenceFile.Writer and appending (key,values) to it.

If you don't care about key, u can make line number as key and complete text as value.

梦明 2024-11-01 06:00:35

如果你安装了 Mahout - 它有一个叫做 seqdirectory 的东西 - 它可以做到这一点

if you have Mahout installed - it has something called : seqdirectory -- which can do it

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文