如何将 Hadoop Streaming 与 LZO 压缩序列文件一起使用?
我正在尝试使用 Amazon 的 Elastic Map Reduce 来处理 Google ngrams 数据集。 http://aws.amazon.com/datasets/8172056142375670 有一个公共数据集,我想要使用 Hadoop 流。
对于输入文件,它表示“我们将数据集存储在 Amazon S3 中的单个对象中。该文件采用块级 LZO 压缩的序列文件格式。序列文件键是存储为 LongWritable 的数据集的行号, value 是存储为 TextWritable 的原始数据。”
我需要做什么才能使用 Hadoop Streaming 处理这些输入文件?
我尝试在我的参数中添加一个额外的“-inputformat SequenceFileAsTextInputFormat”,但这似乎不起作用——我的工作由于某些未指定的原因而不断失败。我还缺少其他论点吗?
我尝试使用一个非常简单的身份作为我的映射器和减速器,
#!/usr/bin/env ruby
STDIN.each do |line|
puts line
end
但这不起作用。
I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.
For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."
What do I need to do in order to process these input files with Hadoop Streaming?
I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?
I've tried using a very simple identity as both my mapper and reducer
#!/usr/bin/env ruby
STDIN.each do |line|
puts line
end
but this doesn't work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
lzo 打包为 Elastic MapReduce 的一部分,因此无需安装任何东西。
我刚刚尝试过这个,它有效......
lzo is packaged as part of elastic mapreduce so there's no need to install anything.
i just tried this and it works...
由于许可问题,Lzo 压缩已从 Hadoop 0.20.x 中删除。如果要处理 lzo 压缩的序列文件,则必须在 hadoop 集群中安装和配置 lzo 本机库。
Kevin 的 Hadoop-lzo 项目是我所知道的当前工作解决方案。我已经尝试过了。有用。
在操作系统上安装(如果尚未安装)lzo-devel 软件包。这些软件包在操作系统级别启用 lzo 压缩,否则 hadoop lzo 压缩将无法工作。
按照 hadoop-lzo 自述文件中指定的说明进行编译。构建后,您将获得 hadoop-lzo-lib jar 和 hadoop lzo 本机库。确保从配置集群的机器(或同一架构的机器)编译它。
还需要 Hadoop 标准本机库,Linux 发行版中默认提供了这些库。如果您使用的是 Solaris,您还需要从源代码构建 hadoop 以获得标准的 hadoop 本机库。
完成所有更改后重新启动集群。
Lzo compression has been removed from Hadoop 0.20.x onwards due to licensing issues. If you want to process lzo-compressed sequence files, lzo native libraries have to be installed and configured in hadoop cluster.
Kevin's Hadoop-lzo project is the current working solution I am aware of. I have tried it. It works.
Install ( if not done already so ) lzo-devel packages at OS. These packages enable lzo compression at the OS level without which hadoop lzo compression won't work.
Follow the instructions specified in the hadoop-lzo readme and compile it. After build, you would get hadoop-lzo-lib jar and hadoop lzo native libraries. Ensure that you compile it from the machine ( or machine of same arch ) where your cluster is configured.
Hadoop standard native libraries are also required which have been provided in the distribution by default for linux. If you are using solaris, you would also need to build hadoop from source inorder to get standard hadoop native libraries.
Restart the cluster once all changes are made.
您可能想看看这个 https://github.com/kevinweil/hadoop-lzo
You may want to look at this https://github.com/kevinweil/hadoop-lzo
我使用 lzo 得到了奇怪的结果,并且我的问题通过其他一些编解码器得到了解决
然后事情就正常了。您不需要(也许也不应该)更改
-inputformat
。I have weird results use lzo and my problem get resolved with some other codec
Then things just work. You don't need (maybe also shouldn't) to change the
-inputformat
.