亚马逊云上的 Hadoop

发布于 2024-12-14 04:24:34 字数 388 浏览 3 评论 0原文

我正在尝试在 Amazon Cloud 上进行设置以运行一些 hadoop MapReduce 作业，但我很难成功创建集群。我已经下载了 ec2 文件、证书和密钥对文件，但我相信是 AMI 给我带来了麻烦。如果我尝试运行具有一个主节点和 n 个从节点的集群，我会使用标准兼容 AMI 启动 n+1 个实例，然后在终端中运行代码“hadoop-ec2 launch-cluster name n”。主节点成功，但当从节点开始启动时出现错误，提示“缺少参数 -h（AMI 丢失）”，并且我不完全确定如何进行。

另外，我的一些工作需要更改 hadoops 参数设置（特别是 mapred-site.xml 配置文件），是否可以更改此文件，如果可以，我如何访问它？ hadoop 是否已经安装在亚马逊机器上，并且该文件可访问和更改？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倚栏听风 2024-12-21 04:24:34

您是否尝试过 Amazon Elastic MapReduce？这是一个简单的 API，可按需启动指定大小的 Hadoop 集群。

这比手动创建自己的集群更容易。

但是，一旦作业流程默认完成，它就会关闭集群，将输出留在 S3 上。如果您需要的只是进行一些处理，那么这可能是正确的选择。

如果您需要永久存储 HDFS 内容（例如，如果您在 Hadoop 之上运行 HBase），您实际上可能需要在 EC2 上拥有自己的集群。在这种情况下，您可能会发现适用于 Amazon EC2 的 Cloudera Hadoop 发行版很有用。

使用 EC2 Bootstrap Actions 可以更改将启动的节点上的 Hadoop 配置：

问：如何为我的作业流程配置 Hadoop 设置？
Elastic MapReduce 默认 Hadoop 配置适用于大多数工作负载。但是，根据作业流程的特定内存和处理要求，调整这些设置可能是合适的。例如，如果您的作业流任务是内存密集型的，您可以选择每个核心使用较少的任务并减少作业跟踪器堆大小。对于这种情况，可以使用预定义的引导操作来在启动时配置作业流程。有关配置详细信息和使用说明，请参阅开发人员指南中的配置内存密集型引导操作。还提供了额外的预定义引导操作，允许您将集群设置自定义为您选择的任何值。有关使用方法，请参阅开发人员指南中的配置 Hadoop Bootstrap 操作说明。

关于启动集群的方式，请澄清：

如果我尝试运行具有主节点和 n 个从节点的集群，我会使用标准兼容 AMI 启动 n+1 个实例，然后在终端中运行代码“hadoop-ec2 launch-cluster name n”。主节点成功，但当从节点开始启动时出现错误，提示“缺少参数 -h（AMI 丢失）”，并且我不完全确定如何进行。

你究竟想如何开始它？您到底使用什么 AMI？

Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.

That's easier then to create own cluster manually.

But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.

In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.

Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:

Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

About the way you are starting the cluster, please clarify:

If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.

How exactly you are trying start it? What exactly AMIs are you using?

回复收藏 0 原文

~没有更多了~