IMPORT YOUR OWN RSA SSH KEY INTO AMAZON EC2

Recently, I needed to import my own RSA SSH key to EC2. Here is a quick way to import to all regions

NB: You need the ec2 tools set up before you can run this. You will also need to have setup an x509 certificate pair.

You can read more about the ec2-import-keypair command in the EC2 documentation.

Assign multiple security groups on Amazon EC2 Run Instances (ec2-run-instances)

Recently, I needed to deploy new servers with multiple defined security groups on AWS EC2 using CLI.

Note: Security groups can only be defined on instance launch unless using VPC.

Logging the client IP behind Amazon ELB with Apache

When you place your Apache Web Server behind an Amazon Elastic Load Balancer, Apache receives all requests from the ELB’s IP address.

Therefore, if you wish to do anything with the real client IP address, such as logging or whitelisting, you need to make use of the X-Forwarded-For HTTP Header Amazon ELB includes in each request which contains the IP address of the original host.

Solution for logging the true client IP

Before:

After:

The one downside is that depending on how ELB treats X-Forwarded-For, it may allow clients to spoof their source IP.

Hopefully this helps out anyone experiencing this issue.

sudo: unable to resolve host ubuntu

Sometimes after changing elastic-IP settings or stopping/starting instances on EC2, I get an irritating error like this when I execute a command with sudo:

sudo: unable to resolve host domU-12-34-ab-cd-56-78

The fix is to lookup the instance’s private dns name (via ec2-describe-instances or the AWS console ui) and update the hostname on the instance with the first segment of that DNS name (which is something that looks like ip-12-34-56-78 or domU-12-34-ab-cd-56-78). On ubuntu, this is what you need to do (assuming ip-12-34-56-78 is the new hostname):

The first line will set the hostname until you reboot; and the second line will configure the hostname to use once you do reboot.

Install s3cmd/s3tools Debian/Ubuntu

1. Register for Amazon AWS (yes, it asks for credit card)

2. Install s3cmd (following commands are for debian/ubuntu, but you can find how-to for other Linux distributions on s3tools.org/repositories)

3. Get your key and secret key at this link

4. Configure s3cmd to work with your account

5. Make a bucket (must be an original name, s3cmd will tell you if it’s already used)

Auto-start Sphinx (searchd) after reboot Linux

By default, after you install and configure Sphinx, you will find that once your OS restarts, search will not be working. That is because searchd is not setup to auto start. The following will solve that problem.

Create file /etc/init.d/searchd.

Copy the following into searchd.

Add execute to the file

Register with auto start

Simple Shell Script to Send Email Ubuntu

We run a local SVN sever that syncs to Amazon S3 + an EC2 instance. Syncing locally to EC2 can take anywhere from 10-20 minutes and instead of waiting for the script to finish I decided to write a quick little email notify script. Enjoy.

How to copy /etc/hosts to all machines

I recently installed Hadoop CDH4 on a new 10 node test cluster and instead of manually entering all the hosts in /etc/hosts I wrote a quick command to copy the current hosts file to all machines.

Hopefully this saves you some time!

Distributed Apache Flume Setup With an HDFS Sink

I have recently spent a few days getting up to speed with Flume, Cloudera‘s distributed log offering. If you haven’t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I’m not going to spend time talking about it because you can read more about it in the users guide or in the Quora Flume Topic in ways that are better than I can describe it. But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to a HDFS sink.

Context
I have 3 kinds of servers all running Ubuntu 10.04 locally:

hadoop-agent-1: This is the agent which is producing all the logs
hadoop-collector-1: This is the collector which is aggregating all the logs (from hadoop-agent-1, agent-2, agent-3, etc)
hadoop-master-1: This is the flume master node which is sending out all the commands

To add the CDH3 repository:

Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents:

where:
is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above.

(To install a different version of CDH on a Debian system, specify the version number you want in the -cdh3 section of the deb command. For example, to install CDH3 Update 0 for Ubuntu Maverick, use maverick-cdh3u0 in the command above.)

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

This key enables you to verify that you are downloading genuine packages

Initial Setup
On both hadoop-agent-1 and hadoop-collector-1, you’ll have to install flume-node (flume-node contains the files necessary to run the agent or the collector).

On hadoop-master-1:

First let’s jump onto the agent and set that up. Tune the hadoop-master-1 and hadoop-collector-1 variables appropriately, but change your /etc/flume/conf/flume-site.xml to look like:

Now on to the collector. Same file, different config.

Web Based Setup

I chose to do the individual machine setup via the master web interface. You can get to this pointing your web browser at http://hadoop-master-1:35871/ (replace hadoop-master-1 with public/private DNS IP of your flume master or setup /etc/hosts for a hostname). Ensure that the port is accessible from the outside through your security settings. At this point, it was easiest for me to ensure all hosts running flume could talk to all ports on all other hosts running flume. You can certainly lock this down to the individual ports for security once everything is up and running.

At this point, you should go to hadoop-agent-1 and hadoop-collector-1 run /etc/init.d/flume-node start. If everything goes well, then the master (whose IP is specified in their configs) should be notified of their existence. Now you can configure them from the web. Click on the config link and then fill in the text lines as follows (use what is in bold):

Agent Node: hadoop-agent-1
Source: tailDir(“/var/logs/apache2/”,”.*.log”)
Sink: agentBESink(“hadoop-collector-1”,35853)
Note: I chose to use tailDir since I will control rotating the logs on my own. I am also using agentBESink because I am ok with losing log lines if the case arises.

Now click Submit Query and go back to the config page to setup the collector:

Agent Node: hadoop-collector-1
Source: collectorSource(35853)
Sink: collectorSink(“hdfs://hadoop-master-1:8020/flume/logs/%Y/%m/%d/%H00″,”server”)
This is going to tell the collector that we are sinking to HDFS with the with an initial folder of ‘flume’. It will then log to sub-folders with “flume/logs/YYYY/MM/DD/HH00” (or 2011/02/03/1300/server-.log). Now click Submit Query and go to the ‘master’ page and you should see 2 commands listed as “SUCCEEDED” in the command history. If they have not succeeded, ensure a few things have been done (there are probably more, but this is a handy start:

Always use double quotes (“) since single quotes (‘) aren’t interpreted correctly. UPDATE: Single quotes are interpreted correctly, they are just not accepted intentionally (Thanks jmhsieh)
In your regex, use something like “.*\\.log” since the ‘.’ is part of the regex.
In your regex, ensure that your blackslashes are properly escaped: “foo\\bar” is the correct version of trying to match “foo\bar”.

Additionally, there are also tables of Node Status and Node Configuration. These should match up with what you think you configured.

At this point everything should work. Admittedly I had a lot of trouble getting to this point. But with the help of the Cloudera folks and the users on irc.freenode.net in #flume, I was able to get things going. The logs sadly aren’t too helpful here in most cases (but look anyway cause they might provide you with more info than they provided for me). If I missed anything in this post or there is something else I am unaware of, then let me know.