我无法让 Hadoop 开始使用 Amazon EC2/S3
我已经创建了 AMI 映像并从 Cloudera CDH2 构建安装了 Hadoop。我这样配置我的 core-site.xml:
<property>
<name>fs.default.name</name>
<value>s3://<BUCKET NAME>/</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value><ACCESS ID></value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value><SECRET KEY></value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop-0.20/cache/${user.name}</value>
</property>
但是当我在 namenode 日志中启动 hadoop 守护进程时,我收到以下错误消息:
2010-11-03 23:45:21,680 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3://<BUCKET NAME>/ is not of scheme 'hdfs'.
at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:198)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1006)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1015)
2010-11-03 23:45:21,691 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
但是,我可以像这样从命令行执行 hadoop 命令:
hadoop fs -put sun-javadb-common-10.5.3-0.2.i386.rpm s3://<BUCKET NAME>/
hadoop fs -ls s3://poc-jwt-ci/
Found 3 items
drwxrwxrwx - 0 1970-01-01 00:00 /
-rwxrwxrwx 1 16307 1970-01-01 00:00 /sun-javadb-common-10.5.3-0.2.i386.rpm
drwxrwxrwx - 0 1970-01-01 00:00 /var
您会注意到那里是存储桶中的 /
和 /var
文件夹。当我第一次看到这个错误时,我运行了 hadoop namenode -format,然后重新启动了所有服务,但仍然收到奇怪的 NameNode 地址的无效 URI(检查 fs.default.name): s3://
我还注意到创建的文件系统如下所示:
hadoop fs -ls s3://<BUCKET NAME>/var/lib/hadoop-0.20/cache/hadoop/mapred/system
Found 1 items
-rwxrwxrwx 1 4 1970-01-01 00:00 /var/lib/hadoop0.20/cache/hadoop/mapred/system/jobtracker.info
对发生的情况有什么想法吗?
I have created an AMI image and installed Hadoop from the Cloudera CDH2 build. I configured my core-site.xml as so:
<property>
<name>fs.default.name</name>
<value>s3://<BUCKET NAME>/</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value><ACCESS ID></value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value><SECRET KEY></value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop-0.20/cache/${user.name}</value>
</property>
But I get the following error message when I start up the hadoop daemons in the namenode log:
2010-11-03 23:45:21,680 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3://<BUCKET NAME>/ is not of scheme 'hdfs'.
at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:198)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1006)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1015)
2010-11-03 23:45:21,691 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
However, I am able to execute hadoop commands from the command line like so:
hadoop fs -put sun-javadb-common-10.5.3-0.2.i386.rpm s3://<BUCKET NAME>/
hadoop fs -ls s3://poc-jwt-ci/
Found 3 items
drwxrwxrwx - 0 1970-01-01 00:00 /
-rwxrwxrwx 1 16307 1970-01-01 00:00 /sun-javadb-common-10.5.3-0.2.i386.rpm
drwxrwxrwx - 0 1970-01-01 00:00 /var
You will notice there is a /
and a /var
folders in the bucket. I ran the hadoop namenode -format when I first saw this error, then restarted all services, but still receive the weird Invalid URI for NameNode address (check fs.default.name): s3://<BUCKET NAME>/ is not of scheme 'hdfs'.
I also notice that the file system created looks like this:
hadoop fs -ls s3://<BUCKET NAME>/var/lib/hadoop-0.20/cache/hadoop/mapred/system
Found 1 items
-rwxrwxrwx 1 4 1970-01-01 00:00 /var/lib/hadoop0.20/cache/hadoop/mapred/system/jobtracker.info
Any ideas of what's going on?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
首先,我建议您只使用 Amazon Elastic MapReduce。您端需要零配置。 EMR 还具有一些对您有利的内部优化和监控。
其次,不要使用 s3: 作为默认 FS。首先,s3 太慢,无法用于存储作业之间的中间数据(hadoop 中的典型工作单元是十几个到几十个 MR 作业)。它还以“专有”格式(块等)存储数据。因此外部应用程序无法有效地访问 s3 中的数据。
请注意,EMR 中的 s3: 与标准 hadoop 发行版中的 s3: 不同。亚马逊的家伙实际上将 s3: 别名为 s3n: (s3n: 只是原始/本机 s3 访问)。
First I suggest you just use Amazon Elastic MapReduce. There is zero configuration required on your end. EMR also has a few internal optimizations and monitoring that works in your benefit.
Second, do not use s3: as your default FS. First, s3 is too slow to be used to store intermediate data between jobs (a typical unit of work in hadoop is a dozen to dozens of MR jobs). it also stores the data in a 'proprietary' format (blocks etc). So external apps can't effectively touch the data in s3.
Note that s3: in EMR is not the same s3: in the standard hadoop distro. The amazon guys actually alias s3: as s3n: (s3n: is just raw/native s3 access).
我认为你不应该执行
bin/hadoop namenode -format
,因为它用于格式化hdfs。在后来的版本中,hadoop 将这些函数移到了一个名为“bin/hdfs”的单独脚本文件中。在 core-site.xml 等配置文件中设置好配置参数后,您可以直接使用 S3 作为底层文件系统。I think you should not execute
bin/hadoop namenode -format
, because it is used for format the hdfs. In the later version, hadoop has move these functions in a separate scripts file which called "bin/hdfs". After you set the configuration parameters in core-site.xml and other configuration files, you can use S3 as the underlying file system directly.您还可以使用 Apache Whirr 来完成此工作流程,如下所示:
首先下载最新版本 (0.7.0)。 0)来自 http://www.apache.org/dyn/closer。 cgi/whirr/
解压存档并尝试运行
./bin/whirr 版本.您需要安装 Java 才能工作。
使您的 Amazon AWS 凭证可用作环境变量:
通过编辑
recipes/hadoop-ec2.properties
更新 Hadoop EC2 配置以满足您的需求。有关详细信息,请参阅配置指南。通过运行以下命令启动 Hadoop 集群:
您可以通过执行
tail -f Whirr.log
来查看详细日志记录输出有关更多说明,您应该阅读快速入门指南和< href="http://whirr.apache.org/docs/latest/whirr-in-5-minutes.html" rel="nofollow">5 分钟指南。
免责声明:我是提交者之一。
You could also use Apache Whirr for this workflow like this:
Start by downloading the latest release (0.7.0 at this time) from http://www.apache.org/dyn/closer.cgi/whirr/
Extract the archive and try to run
./bin/whirr version
. You need to have Java installed for this to work.Make your Amazon AWS credentials available as environment variables:
Update the Hadoop EC2 config to match your needs by editing
recipes/hadoop-ec2.properties
. Check the Configuration Guide for more info.Start a cluster Hadoop by running:
You can see verbose logging output by doing
tail -f whirr.log
Now you can login to your cluster and do your work.
For more explanations you should read the Quick Start Guide and the 5 minutes guide.
Disclaimer: I'm one of the committers.
使用
在 /etc/hadoop/conf/core-site.xml 中
fs.defaultFS = s3n://awsAccessKeyId:awsSecretAccessKey@BucketName然后不要启动你的 datanode 或 namenode,如果你有需要你的 datanode 和 namenode 的服务,这将不会工作..
我做到了这一点,并且可以使用以下命令访问我的存储桶
sudo hdfs dfs -ls /
注意,如果您的 awsSecretAccessKey 带有“/”字符,那么您必须对其进行 url 编码。
Use
fs.defaultFS = s3n://awsAccessKeyId:awsSecretAccessKey@BucketName in your /etc/hadoop/conf/core-site.xml
Then do not start your datanode or namenode, if you have services that need your datanode and namenode this will not work..
I did this and can access my bucket using commands like
sudo hdfs dfs -ls /
Note if you have awsSecretAccessKey's with "/" character then you will have to url encode this.
使用 s3n 而不是 s3。
Use s3n instead of s3.