亚马逊云上的 Hadoop

发布于 2024-12-14 04:24:34 字数 388 浏览 3 评论 0原文

我正在尝试在 Amazon Cloud 上进行设置以运行一些 hadoop MapReduce 作业,但我很难成功创建集群。我已经下载了 ec2 文件、证书和密钥对文件,但我相信是 AMI 给我带来了麻烦。如果我尝试运行具有一个主节点和 n 个从节点的集群,我会使用标准兼容 AMI 启动 n+1 个实例,然后在终端中运行代码“hadoop-ec2 launch-cluster name n”。主节点成功,但当从节点开始启动时出现错误,提示“缺少参数 -h(AMI 丢失)”,并且我不完全确定如何进行。

另外,我的一些工作需要更改 hadoops 参数设置(特别是 mapred-site.xml 配置文件),是否可以更改此文件,如果可以,我如何访问它? hadoop 是否已经安装在亚马逊机器上,并且该文件可访问和更改?

谢谢

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.

Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

倚栏听风 2024-12-21 04:24:34

您是否尝试过 Amazon Elastic MapReduce?这是一个简单的 API,可按需启动指定大小的 Hadoop 集群。

这比手动创建自己的集群更容易。

但是,一旦作业流程默认完成,它就会关闭集群,将输出留在 S3 上。如果您需要的只是进行一些处理,那么这可能是正确的选择。

如果您需要永久存储 HDFS 内容(例如,如果您在 Hadoop 之上运行 HBase),您实际上可能需要在 EC2 上拥有自己的集群。在这种情况下,您可能会发现适用于 Amazon EC2 的 Cloudera Hadoop 发行版很有用。

使用 EC2 Bootstrap Actions 可以更改将启动的节点上的 Hadoop 配置:

问:如何为我的作业流程配置 Hadoop 设置?

Elastic MapReduce 默认 Hadoop 配置适用于大多数工作负载。但是,根据作业流程的特定内存和处理要求,调整这些设置可能是合适的。例如,如果您的作业流任务是内存密集型的,您可以选择每个核心使用较少的任务并减少作业跟踪器堆大小。对于这种情况,可以使用预定义的引导操作来在启动时配置作业流程。有关配置详细信息和使用说明,请参阅开发人员指南中的配置内存密集型引导操作。还提供了额外的预定义引导操作,允许您将集群设置自定义为您选择的任何值。有关使用方法,请参阅开发人员指南中的配置 Hadoop Bootstrap 操作说明。

关于启动集群的方式,请澄清:

如果我尝试运行具有主节点和 n 个从节点的集群,我会使用标准兼容 AMI 启动 n+1 个实例,然后在终端中运行代码“hadoop-ec2 launch-cluster name n”。主节点成功,但当从节点开始启动时出现错误,提示“缺少参数 -h(AMI 丢失)”,并且我不完全确定如何进行。

你究竟想如何开始它?您到底使用什么 AMI?

Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.

That's easier then to create own cluster manually.

But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.

In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.

Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:

Q: How do I configure Hadoop settings for my job flow?

The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

About the way you are starting the cluster, please clarify:

If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.

How exactly you are trying start it? What exactly AMIs are you using?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文