当前位置：文江博客话题详情

如何将 .txt 文件转换为 Hadoop 的序列文件格式

发布于 2024-10-25 06:00:35 字数 287 浏览 5 评论 0原文

为了有效利用Hadoop中的map-reduce作业，我需要将数据存储在hadoop 的序列文件格式。但是，目前数据仅采用平面 .txt 格式。任何人都可以建议一种将 .txt 文件转换为序列文件的方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱情眠于流年 2024-11-01 06:00:35

因此，更简单的答案就是一个具有 SequenceFile 输出的“身份”作业。

在java中看起来像这样：

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));

    // submit and wait for completion
    job.waitForCompletion(true);
   }

So the way more simplest answer is just an "identity" job that has a SequenceFile output.

Looks like this in java:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));

    // submit and wait for completion
    job.waitForCompletion(true);
   }

回复收藏 0 原文

留蓝 2024-11-01 06:00:35

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main( String[] args) throws IOException { 
        String uri = args[ 0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create( uri), conf);
        Path path = new Path( uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try { 
            writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
            for (int i = 0; i < 100; i ++) { 
                key.set( 100 - i);
                value.set( DATA[ i % DATA.length]);
                System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                writer.append( key, value); } 
        } finally 
        { IOUtils.closeStream( writer); 
        } 
    } 
}

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main( String[] args) throws IOException { 
        String uri = args[ 0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create( uri), conf);
        Path path = new Path( uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try { 
            writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
            for (int i = 0; i < 100; i ++) { 
                key.set( 100 - i);
                value.set( DATA[ i % DATA.length]);
                System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                writer.append( key, value); } 
        } finally 
        { IOUtils.closeStream( writer); 
        } 
    } 
}

回复收藏 0 原文

橘亓 2024-11-01 06:00:35

这取决于 TXT 文件的格式。每条记录一行吗？如果是这样，您可以简单地使用 TextInputFormat 为每一行创建一个记录。在映射器中，您可以解析该行并以您选择的方式使用它。

如果不是每条记录一行，您可能需要编写自己的 InputFormat 实现。请参阅本教程了解更多信息。

回复收藏 0 原文

鞋纸虽美，但不合脚ㄋ〞 2024-11-01 06:00:35

您还可以创建一个中间表，将 csv 内容直接加载到其中，然后创建第二个表作为序列文件（分区、集群等）并从中间表中插入选择。您还可以设置压缩选项，例如，

set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

create table... stored as sequencefile;

insert overwrite table ... select * from ...;

MR 框架将为您处理繁重的工作，省去您编写 Java 代码的麻烦。

You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,

set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

create table... stored as sequencefile;

insert overwrite table ... select * from ...;

The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.

回复收藏 0 原文

愛放△進行李 2024-11-01 06:00:35

请注意格式说明符 :。

例如（注意%和s之间的空格），System.out.printf("[%s]\t%s\t%s\n ", writer.getLength(), key, value); 将为我们提供 java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

相反，我们应该使用：

System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

Be watchful with format specifier :.

For example (note the space between % and s), System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

Instead, we should use:

System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

回复收藏 0 原文

半仙 2024-11-01 06:00:35

如果您的数据不在HDFS上，则需要将其上传到HDFS。两个选项：

i) hdfs - 放在您的 .txt 文件上，一旦您将其放在 HDFS 上，您可以将其转换为 seq 文件。

ii) 您将文本文件作为 HDFS 客户端框的输入，并通过创建 SequenceFile.Writer 并向其附加 (key,values)，使用序列文件 API 将文本文件转换为 SeqFile。

如果你不关心键，你可以将行号作为键，将完整的文本作为值。

回复收藏 0 原文

梦明 2024-11-01 06:00:35

如果你安装了 Mahout - 它有一个叫做 seqdirectory 的东西 - 它可以做到这一点

回复收藏 0 原文

~没有更多了~

关于作者

黑寡妇

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

如何将 .txt 文件转换为 Hadoop 的序列文件格式

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如何将 .txt 文件转换为 Hadoop 的序列文件格式

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。