Run a Hadoop Cluster on EC2 the easy way using Apache Whirr

To set up the cluster, you need two things-
1.    An AWS account
2.    A local machine running Ubuntu (Mine was running lucid)
The following steps should do the trick-
Step 1 – Add the JDK repository to apt and install JDK (replace lucid with your Ubuntu version, check using lsb_release – c in the terminal) –

Step 2 – Create a file named cloudera.list in /etc/apt/sources.list.d/ and paste the following content in it (again, replace lucid with your version)-

Step 3 – Add the Cloudera Public Key to your repository, update apt,  install Hadoop and Whirr-

Step 4 – Create a file in your $HOME folder and paste the following content in it.

Step 5 – Replace [AWS ID] and [AWS KEY] with your own AWS Access Identifier and Key. You can find them in the Access Credentials section of your Account. Notice the third line, you can use it to define the nodes that will run on your cluster. This cluster will run a node as combined namenode (nn) and jobtracker (jt) and another node as combined datanode (dn) and tasktracker (tt).

Step 6 – Generate a RSA keypair on your machine. Do not enter any passphrase.

Step 7 – Launch the cluster! Navigate to your home directory and run-

This step will take some time as Whirr creates instances and configures Hadoop on them.

Step 8 – Run a Whirr Proxy. The proxy is required for secure communication between master node of the cluster and the client machine (your Ubuntu machine). Run the following command in a new terminal window-

Step 9 – Configure the local Hadoop installation to use Whirr for running jobs.

Step 10 – Add $HADOOP_HOME to ~/.bashrc file by placing the following line at the end-

Step 11 – Test run a MapReduce job-

Step 12 (Optional) – Destroy the cluster-

Note: This tutorial was prepared using material from the CDH3 Installation Guide

Share on FacebookShare on Google+Email this to someoneShare on RedditShare on LinkedInShare on TumblrTweet about this on TwitterShare on StumbleUpon

Leave a Reply

Your email address will not be published.