如何将 .txt 文件转换为 Hadoop 的序列文件格式
为了有效利用Hadoop中的map-reduce作业,我需要将数据存储在hadoop 的序列文件格式。但是,目前数据仅采用平面 .txt 格式。任何人都可以建议一种将 .txt 文件转换为序列文件的方法吗?
To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop's sequence file format. However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
因此,更简单的答案就是一个具有 SequenceFile 输出的“身份”作业。
在java中看起来像这样:
So the way more simplest answer is just an "identity" job that has a SequenceFile output.
Looks like this in java:
这取决于 TXT 文件的格式。每条记录一行吗?如果是这样,您可以简单地使用 TextInputFormat 为每一行创建一个记录。在映射器中,您可以解析该行并以您选择的方式使用它。
如果不是每条记录一行,您可能需要编写自己的 InputFormat 实现。请参阅本教程了解更多信息。
It depends on what the format of the TXT file is. Is it one line per record? If so, you can simply use TextInputFormat which creates one record for each line. In your mapper you can parse that line and use it whichever way you choose.
If it isn't one line per record, you might need to write your own InputFormat implementation. Take a look at this tutorial for more info.
您还可以创建一个中间表,将 csv 内容直接加载到其中,然后创建第二个表作为序列文件(分区、集群等)并从中间表中插入选择。您还可以设置压缩选项,例如,
MR 框架将为您处理繁重的工作,省去您编写 Java 代码的麻烦。
You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,
The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.
请注意格式说明符
:
。例如(注意
%
和s
之间的空格),System.out.printf("[%s]\t%s\t%s\n ", writer.getLength(), key, value);
将为我们提供java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =
相反,我们应该使用:
Be watchful with format specifier
:
.For example (note the space between
%
ands
),System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value);
will give usjava.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =
Instead, we should use:
如果您的数据不在HDFS上,则需要将其上传到HDFS。两个选项:
i) hdfs - 放在您的 .txt 文件上,一旦您将其放在 HDFS 上,您可以将其转换为 seq 文件。
ii) 您将文本文件作为 HDFS 客户端框的输入,并通过创建 SequenceFile.Writer 并向其附加 (key,values),使用序列文件 API 将文本文件转换为 SeqFile。
如果你不关心键,你可以将行号作为键,将完整的文本作为值。
If your data is not on HDFS, you need to upload it to HDFS. Two options:
i) hdfs -put on your .txt file and once you get it on HDFS, you can convert it to seq file.
ii) You take text file as input on your HDFS Client box and convert to SeqFile using Sequence File APIs by creating a SequenceFile.Writer and appending (key,values) to it.
If you don't care about key, u can make line number as key and complete text as value.
如果你安装了 Mahout - 它有一个叫做 seqdirectory 的东西 - 它可以做到这一点
if you have Mahout installed - it has something called : seqdirectory -- which can do it